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PREFACE 


THIS  TKCIINICAI.  Itr.l’OItT  survcy.s  Artificial  Intelligence  research  ill  the  area 
of  learning  and  inductive  inference.  It  was  written  :vs  Chapter  XIV  of  Volume 
III  of  the  llutiitbook  <>/  Artijicinl  Intrltiijencr.  Since  Al  learning  research  is  still 
in  its  infancy,  this  chapter  does  not  present  many  well-nnderstooc  research 
results.  Instead,  we  have  attempted  to  provide  a  framework  ...i  viewing  past 
research  and  a  list  of  open  problems  for  future  research. 

This  survey  is  necessarily  incomplete,  and  we  apologize  to  those  research¬ 
ers  whose  work  is  not  mentioned.  In  choosing  which  systems  ta  include, 
we  considered  several  different  criteria,  such  as  historical  importance  (e.g., 
Samuel,  Waterman,  Winston),  performance  (e  g.,  CLS/ID.l,  Mota-DKNDItAL, 
Samuel),  relevance  to  outstanding  problems  (e.g.,  I.KX),  and  demonstration  of 
unusual  lerhniipies  (e.g.,  I.enat,  Dietterich  and  Michalski,  Langley).  We  at¬ 
tempted  to  select  at  least  one  representative  program  from  each  of  the  various 
learning  methods  and  learning  situations.  In  some  cases,  v.e  have  also  taken 
liberties  in  resisting  the  terminology  and  representation  of  a  system  in  order 
to  improve  the  uniformity  of  the  chapter  (e.g.,  Iluycs-ltoth,  Snssuiaii). 

This  chapter  was  a  group  effort.  Hob  London  helped  to  outline  the  chapter 
and  wrote  the  articles  on  rote  learning  and  advice-taking.  Kenneth  Clarkson 
contributed  the  article  on  grammatical  inference,  and  CoolT  Dromey  wrote  the 
article  on  adaptive  learning.  The  remainder  of  the  chapter  was  written  by 
Tom  Dietterich.  Valuable  criticisms  were  provided  by  our  reviewers:  James  S. 
Dennett,  Druce  (>.  Duchanan,  llysxard  S.  Michalski,  Thomas  M.  Mitchell,  Jack 
Mostow,  David  Shur,  and  I’aul  IJlgoff.  In  addition,  the  volume  editor,  Paul 
II.  Cohen,  and  the  professional  editor,  Dianne  Kanerva,  helped  immensely  to 
improve  the  form  and  content  of  the  chapter.  Thanks  also  to  Jose  L.  Conialez 
for  assisting  in  the  production  of  this  technical  report. 

Wc  hope  that  this  chapter  will  serve  both  ns  a  useful  reference  for  students 
of  learning  and  as  a  technical  contribution  to  Al  learning  research. 


Tom  Dietterich,  chapter  editor 
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A.  OVERVIEW 


LEARNING  is  a  very  genera]  term  denoting  the  way  in  which  people  (and 
computers)  increase  their  knowledge  and  improve  their  skills.  From  the  very 
beginnings  of  Al,  researchers  have  sought  to  understand  the  process  of  learning 
and  to  create  computer  programs  that  can  learn. 

There  arc  two  fundamental  reasons  for  studying  learning.  One  is  to 
understand  the  process  itself.  By  developing  computer  models  of  learning, 
psychologists  have  attempted  to  gain  an  understanding  of  the  way  humans 
learn.  Philosophers  since  Plato  have  also  been  interested  in  learning  research, 
because  it  may  help  them  understand  what  knowledge  is  and  how  it  grows 

The  second  reason  for  conducting  learning  research  is  to  provide  com¬ 
puters  with  the  ability  to  learn.  It  has  long  been  a  goal  of  Al  to  develop 
computer  systems  that  could  be  taught  rather  than  programmed.  Many  other 
applications  of  computers,  such  as  intelligent  programs  for  assisting  scientists, 
involve  the  acquisition  of  new  knowledge.  Thus,  learning  research  has  poten¬ 
tial  for  extending  the  range  of  problems  to  which  computers  can  be  applied. 

In  this  overview  article,  wc  first  present  a  short  history  of  A 1  research  on 
learning.  This  is  followed  by  a  review  of  /VI  perspectives  on  learning,  from 
which  a  simple  model  of  learning  is  developed.  This  model  allows  us  to  discuss 
some  of  the  major  factors  affecting  the  design  of  learning  systems. 

A  Brief  History  of  Al  Research  on  Learning 

AJ  research  on  learning  has  evolved  through  three  stages.  The  first, 
and  most  optimistic,  stage  of  work  centered  on  self-organixing  systems  that 
modified  themselves  to  adapt  to  llicir  environments  (see  Yovils,  Jacobi,  and 
Goldstein,  1962).  The  hope  was  that  if  a  system  were  given  a  set  of  stimuli, 
a  source  of  feedback,  and  enough  degrees  of  freedom  to  modify  its  own  orga¬ 
nization,  it  would  adapt  itself  toward  an  optimum  organization.  Attempts 
were  made,  for  example,  to  simulate  evolution  in  the  hope  that  intelligent  pro¬ 
grams  would  result  from  the  processes  of  random  mutation  and  natural  selec¬ 
tion  (Friedbcrg,  1 958;  Friedberg,  Dunham,  and  North,  1959;  Fogcl,  Owens, 
and  Walsh,  1966).  Various  computational  analogues  of  neurons  were  devel¬ 
oped  and  tested;  foremost  of  these  was  the  perception  (Rosenblatt,  1'957). 
Unfortunately,  most  of  these  attempts  tailed  to  protlurc  systems  of  any  com¬ 
plexity  or  intelligence  (see  Article  XIV. 02  on  adaptive  learning). 

Theoretical  limitations  were  discovered  that  dampened  the  optimism  of 
these  early  Al  researchers  (see  Minsky  and  I’aperl,  1969).  In  the  1960s,  atten¬ 
tion  moved  away  from  learning  toward  knowledge- based  problem  solving  and 
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natural-language  understanding  (Minsky,  11)68).  Those  people  who  continued 
to  work  with  adaptive  systems  reased  to  consider  themselves  AI  researchers; 
their  research  branched  oir  to  become  a  subarea  of  linear  systems  theory. 
Adaptive-systems  techniques  are  presently  applied  to  problems  in  pattern 
recognition  and  control  theory. 

The  beginning  of  the  l!)70s  saw  a  renewal  of  interest  in  learning  with 
the  publication  of  Winston’s  (1070)  mllucntial  thesis.  In  this  second  stage  of 
learning  research,  workers  adopted  the  view  that  learning  is  a  complex  and 
dillicult  process  and  that,  consequently,  a  learning  system  cannot  be  expected 
to  learn  high-level  concepts  bv  starting  without  any  knowledge  at  all.  This 
view  has  led  researchers,  on  the  one  hand,  to  study  simple  learning  problems 
in  depth  (such  as  learning  single  concepts)  and,  on  the  other,  to  incorporate 
large  amounts  of  doin  tin  knowledge  into  learning  systems  (such  as  the  Mota- 
Dl'NDItAl,  anil  AM  programs  discussed  iu  Articles  XlV.Dlb  and  XIV. fMr)  so 
that  they  could  discover  high-level  concepts. 

A  third  stage  of  learning  research,  motivated  by  the  need  to  acquire 
knowledge  for  expert  systems,  is  now  under  way.  Unlike  the  first  two  phases  of 
learning  research,  which  focused  on  rote  learning  and  learning  from  examples, 
the  current  work  looks  at  all  forms  of  learning,  including  advice-taking  and 
learning  from  analogies. 

Fou r  Perspectives  on  Learning 

Herbert  Simon  (in  press)  defines  learning  as  any  proce ss  by  which  a 
system  improves  its  performance.  His  definition  assumes  that  the  system  has 
a  task  that  it  is  attempting  to  perform.  It  may  improve  its  performance  by 
applying  new  methods  and  knowledge  or  by  improving  existing  methods  and 
knowledge  to  make  them  faster,  more  accurate,  or  more  robust. 

A  more  constrained  view  of  learning,  adopted  by  many  people  who  work 
on  expert  systems,  is  that  learning  is  the  aeifuisition  of  explicit  knowteihje. 
Many  expert  systems  represent  their  expertise  as  large  collections  of  rules 
that  need  to  be  acquired,  organized,  and  extended.  This  view  emphasizes 
the  importance  of  making  the  acquired  knowledge  explicit,  so  that  it  can  he 
easily  verified,  modified,  and  explained.  Researchers  are  presently  working 
on  knowledge-acquisition  systems  that  discover  new  rules  from  examples  or 
accept  new  rules  from  experts  and  integrate  them  into  the  knowledge  base  of 
the  system. 

A  third  view  is  that  learning  is  skill  acquisition.  Psychologists  have 
pointed  out  that  long  after  people  are  told  how  to  do  a  task,  such  as  touch 
typing  or  computer  programming,  their  performance  on  that  task  continues 
to  improve  through  practice  (Norman,  IU80).  It  appears  that  although  people 
can  easily  understand  verbal  instructions  on  how  to  perform  a  task,  much 
work  remains  to  he  done  to  turn  that  verbal  knowledge  into  efficient  mental  or 
muscular  operations.  Researchers  in  AI  and  cognitive  psychology  have  sought 
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to  ui'.lorsi .vini  l  lie  kinds  i)f  knowledge  that  .in1  needed  to  perform  skillfully. 
Tin-  proce->  en  !>y  wiacii  pi-ople  acquire  this  knowledge  through  practice  are 

lit !  !o  understood. 

The  collective  enterprise  of  science  is  usually  considered  co  be  one  of  the 
most  etfoctive  ways  tli.it  our  culture  learns  about  the  world.  Thus,  a  fourth 
view  .if  learning  is  that  it  is  theory  formation,  hypothesis  formation,  and 
i u  iuetu-e  irt/cr-rirr  Work  on  theory  I. 'relation  has  centered  on  understanding 
lien  scientists  bui'o  theories  to  describe  and  explain  complex  phenomena.  A 
nee  e-Mary  part  .if  theory  format. on  is  hypothesis  formation  -the  activity  of 
finding  one  or  more  plausible  hypotheses  'o  explain  a  particular  set  of  data 
in  the  context  of  a  more  general  theory  Another  aspect  of  theory  formation 
is  inductile  inference  -the  process  of  inferring  general  laws  from  particular 
examples. 

.1  s'lei pie  /  of  /.nirntn-j  and  Its  Implications 

jo:  the  Drsi.jn  •  ■/  /.cirrnmy  Systems 

Of  t lies'’  four  views  of  learning,  Simon's  (in  press)  ie  perhaps  the  most 
rnrom passing  l'akin-.;  1 1 its  definition  as  a  starting  point,  we  have  developed 
the  simple  model  of  learning  systems  shown  m  Figure  A  1.  Throughout 
tills  cltapter,  we  use  this  simple  model  to  organuo  our  discussion  of  learning 

s  y  stems. 

In  the  model,  t  he  circles  denote  declarative  bodies  of  information  (e  g.,  facts 
represented  m  predicate  calculus  or  statements  made  by  an  expert),  while  the 
l-ovis  denote  procedures.  The  arrows  show  the  predominant  direction  of  data 
Ilow  throne, h  the  learning  system.  The  environment  supplies  some  informa¬ 
tion  to  tin'  learning  .  lenient,  the  learning  element  uses  this  information  to 
make  improvements  in  an  explicit  knowledge  base,  ai  d  tile  performance  cle¬ 
ment  uses  the  knowledge  base  to  perform  its  task.  Finally,  information  gained 
during  attempt.,  to  perform  l  be  task  can  serve  as  feedback  to  the  learning 
element.  This  nn 'del  is  primitive  and  omits  many  important  functions.  It  is 
useful,  however,  in  that  it  allows  us  to  classify  learning  systems  according  to 
how  they  " I i i I  these  four  functional  units,  in  any  particular  application,  the 
envronment.  the  knowledge  base,  and  the  performance  task  determine  the 
nature  of  the  partu  ul.ir  learning  problem  .ml.  hence,  the  particular  functions 
ill.. I  the  learning  element  must  fulfill,  in  the  following  three  sections,  we 


Learning 

/^Knowledge's 
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Element 
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Figure  A  I.  A  liinple  model  of  learning  systems. 
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examine  the  role  of  each  of  these  three  functional  units  that  surround  the 
learning  element. 

The  Environment 

The  most  important  factor  aiTecting  the  design  of  learning  systems  is  the 
kind  of  information  supplied  to  the  system  by  the  environment  —particularly 
the  level  and  i/unlity  of  this  information. 

The  level  of  information  refers  to  the  degree  of  generality  (or  domain 
of  applicability)  of  the  information  relative  to  the  needs  of  the  performance 
element.  High-level  information  is  abstract  information  that  is  relevant  to  a 
broad  class  of  problems.  Low-level  information  is  detailed  information  that  is 
relevant  to  a  single  problem.  The  task  of  the  learning  element  ran  be  viewed 
as  the  task  of  bridging  the  gap  between  tin-  level  at  which  the  information  is 
provided  by  the  environment  and  the  level  at  which  the  performance  element 
can  use'  the  information  to  cat  y  out  its  function.  Thus,  if  the  learning  system 
is  given  very  abstract  (high-level)  advice  about  its  performance  task,  it  must 
(ill  in  the  missing  details,  so  that  the  performance  element  can  interpret 
the  information  in  particular  situations.  Correspondingly,  if  the  system  is 
given  very  specific  (low-level)  information  about  how  to  perform  in  particular 
situations,  the  learning  element  must  generalize  this  information  —by  ignoring 
unimportant  details — into  a  rule  that  can  be  used  to  guide  the  performance 
element  in  a  broader  class  of  situations. 

Since  its  knowledge  is  imperfect,  the  learning  element  does  not  know  in 
advance  exactly  how  to  fill  in  missing  details  or  ignore  unimportant  details. 
Consequently,  it  must  guess  -  that  is,  form  hypotheses  —about  how  the  gap 
between  the  levels  should  bo  bridged.  After  guessing,  the  system  must  receive 
some  feedback  that  allows  it  to  evaluate  its  hypotheses  and  revise  them  if 
necessary.  It  is  in  this  way  that  a  learning  system  learns:  by  trial  and  error. 

The  level  of  the  information  provided  by  the  environment  determines 
the  kinds  of  hypotheses  that  the  system  must  generate.  Four  basic  learning 
situations  can  be  discerned: 

1.  Rote  learniny,  in  which  the  environment  provides  information  exactly  at 
the  level  of  the  performance  task  and,  thus,  no  hypotheses  arr  needed. 

2.  l.eammy  by  briny  toi>l  in  which  the  information  provided  by  the  environ¬ 
ment  is  too  abstract  or  general  and,  thus,  the  learning  elemrnt  must 
hypothesise  the  missing  Vlads 

3.  I.tarntny  from  examples ,  in  which  tin-  information  provided  b>  the  envi¬ 
ronment  is  loo  specific  and  detailed  and,  thus,  the  learning  element  must 
hypothesise  more  general  rules. 

■1.  l.earniny  by  analnyy,  in  which  the  information  provided  by  the  environ¬ 
ment  is  relevant  only  to  an  analogous  performance  task  and,  thus,  the 
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learning  system  must  discover  the  'bnalogy  ami  hypothetic  analogous 

rules  for  its  present  pcrfo-mance  task. 

Each  of  these  learning  situations  is  discussed  in  more  detail  below. 

The  quality  of  ir  formation  can  have  a  significant  effect  on  the  difficulty 
of  the  learning  task  Induction  is  exsicst,  for  example,  when  the  'raining 
instances  are  selected  by  a  cooperative  teacher  who  chooses  '‘clean”  exam¬ 
ples,  classifies  them,  and  presents  them  in  good  pedagogical  order.  Learning 
by  induction  is  particularly  dillicult  when  the  training  instances  are  made 
up  of  noise-ridden,  unclassified  data  that  are  ‘‘presented"  by  nature  in  an 
uncontrollable  fashion.  Similarly,  in  advice-taking  systems,  information  is 
of  little  use  if  it  is  provided  by  an  unreliable  and  inarticulate  expert;  rote 
learning  cannot  succeed  with  poor-quality,  possibly  contradictory  data;  and 
analogies  are  useless  if  th-'y  are  cluttered  with  errors 

The  Knowledge  Bate 

The  second  factor  affecting  the  design  of  learning  systems  is  the  knowledge 
base,  its  form  and  content.  We  discuss  lirst  the  form,  or  representational  sys- 
tr  a,  In  which  the  knowledge  base  s  expressed;  it  is  a  particularly  important 
design  consideration  (see  Chap  III,  in  Vol.  I.  on  representation  of  knowledge). 
Most  work  in  learning  hies  used  one  of  two  basic  representational  forms — 
feature  vectors  and  predicate  calculus  -  although  other  forms,  such  as  produc¬ 
tion  rules,  grammars  MSI’  functions,  numerical  polynomials,  semantic  nets, 
and  frames,  have  also  been  used  These  representational  l’orm3  vary  along 
four  important  dimensions:  expressiveness,  case  of  inference,  modifiability, 
and  extendabiliiy. 

Expressiveness  of  the  representation.  In  any  AI  system  it  is  impor¬ 
tant  to  have  a  representation  in  which  the  relevant  knowledge  can  be  easily 
expressed.  Feature  vectors,  for  example,  art  useful  for  describing  objects  that 
lark  internal  structure.  They  describe  objects  in  terms  of  a  lixed  set.  of  fea¬ 
tures  (such  as  color,  shape,  and  site)  that  take  on  a  finite  set  of  values  (such 
as  red  or  green,  circle  or  squa-c,  and  small  or  large).  Predicate  calculus,  on 
the  other  hand,  is  useful  for  describing  structured  objects  and  situations.  A 
situation  in  which  a  red  object  is  on  top  of  a  green  one,  for  example,  can  be 
expressed  as  3i,  y  :  (tED(i)  A  CIIEEN(y)  A  ONTOI’(x,  y). 

Ease  of  inference  within  the  representation.  The  computational 
cost  of  performing  inference  is  another  important  property  of  a  representa¬ 
tional  system.  One  type  of  inference  frequently  required  in  learning  systems  is 
the  comparison  of  two  descriptions  to  determine  win  llier  they  are  equivalent. 
!t  is  very  easy  to  lest  two  feature-vectors  for  equivalence.  The  comparison  of 
two  prcdical  '-calculus  expressions  is  more  costly.  S'tiec  many  learning  systems 
must  search  large  spaces  of  possible  descriptions,  the  cost  of  comparisons  ran 
severely  limit  the  extent  of  these  searches. 
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Modifiability  of  the  knowledge  base.  A  learning  system  must,  by  its 
verv  nature,  modify  sotne  part  of  the  knowledge  base  to  store  the  knowledge  it 
is  gaining.  Consequently,  most  learning  systems  have  employed  explicit,  sty I- 
ited  representations  (such  as  feature  vectors,  predicate  calculus,  and  produc¬ 
tion  rules)  in  which  it  is  easy  to  add  knowledge  to  the  knowledge  base.  Very 
little  attention  has  been  given  to  the  probiein  of  adding  to  knowledge  bases  in 
which  substantial  revision  and  integration  must  be  performed.  These  prob¬ 
lems  arise,  for  example,  in  systems  that  refer  to  time  or  state  information 
(e.g.,  procedural  representations)  and  in  systems  that  make  default  assump¬ 
tions  that  may  later  need  to  be  retracted. 

Extendnbility  of  the  representation.  For  a  learning  prograrti  to 
manipulate  explicitly  its  acquired  knowledge,  there  must  be  a  meta-level 
description  within  the  program  that  tells  how  the  representation  is  struc¬ 
tured.  This  meta-level  knowledge  has  usually  been  embodied  in  procedures 
that  manipulate  the  data  structures  of  the  representation.  Of  recent  inter¬ 
est  in  learning  research,  however,  are  representational  systems  in  which  this 
meta-knowledge  is  also  made  an  explicit  part  of  the  knowledge  base  (see  Davis, 
197G)  The  purpose  is  to  allow  the  program  to  examine  and  alter  its  own 
representation  by  addmg  vocabulary  terms  and  representational  structures. 
This  ability  in  turn  provides  the  possibility  of  developing  learning  systems 
that  arc  open-ended —that  is,  that  can  learn  successively  more  complex  units 
of  knowledge  without  limit.  The  outstanding  example  of  an  extendable  rep¬ 
resentation  is  Lonat's  ( 1 07 f>)  AM  program  (see  Article  XIV. Die),  which  allows 
new  concepts  to  be  defined  in  terms  of  old  ones.  Recent  work  on  RLL  (Greiner 
and  Lenat.  I  OHO;  Greiner.  11580)  has  pushed  this  idea  much  further  toward 
allowing  a  program  to  define  new  representations  dynamically. 

Now  that  we  have  examined  issues  relating  to  the  form  of  the  knowledge 
base,  we  tu  n  our  attention  to  its  content.  A  learning  system  does  not  gain 
knowledge  by  starttng  '‘from  scratch,”  that  is,  without  any  knowledge  at  all. 
Some  knowledge  must  he  employed  by  every  learning  system  to  understand  the 
information  provided  by  the  environment,  to  form  hypotheses,  and  to  test  and 
refine  those  hypotheses.  Thus,  it  is  more  appropriate  to  view  a  learning  system 
as  extending  and  improving  an  existing  body  of  knowledge.  Unfortunately, 
in  most  learinngjcysterns,  the  knowledge  employed  is  nt.i  explicit;  it  is  built 
into  the  program  by  the  designer.  Throughout  this  chapter,  we  try  to  point 
out  the  ways  in  which  domain-specific  knowledge  has  entered  into  existing 
learning  systems. 

The  Performance  Element 

The  performance  element  is  the  focus  of  the  whole  learning  system,  since 
it  is  the  actions  of  the  performance  element  that  the  learning  element  is  trying 
to  ;mprovo.  There  arc  three  important  issues  related  to  ‘Le  performance 
element:  complexity,  feedback,  and  transparency. 
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First,  the  cnmpleztty  of  the  task  is  important.  Complex  tasks  require 
more  knowledge  than  simple  tasks.  For  instance,  a  simple  task  like  binary 
classification,  in  which  objects  are  classified  into  one  of  two  groups,  requires 
only  a  single  classification  rule  On  (he  other  hand,  a  program  that  can  play  ;t 
reasonable  poker  game  (Waterman,  1 970 )  needs  about  20  rules,  arid  a  medical- 
diagnosis  system  like  MYC1N  (Shortlilfc,  19T(i)  employs  several  hut  dred  rules. 

In  learning  from  examples,  three  classes  of  performance  tasks  can  be 
distinguished  according  to  their  complexity  The  simplest  performance  task 
is  'lannficahon  or  prediction  based  on  a  stm <jtr  concept  or  rn/c.  Indeed,  the 
problem  of  learning  single  concepts  from  examoles  has  received  more  study 
than  any  other  problem  in  Al  learning  research.  Slightly  more  complex  are 
tasks  involving  multiple  concepts.  An  example  is  the  problem  of  predicting 
winch  bonds  of  an  organic  molecule  will  be  broken  in  the  mass  spectrometer; 
the  DDNDItAI.  prediction  program  employs  a  set  of  cleavage  rules  to  perform 
this  task.  The  most  complex  tasks  for  which  learning  systems  have  been 
developed  are  ;mall  planning  tasks  in  which  a  set  of  rules  must  be  applied  m 
sci/rirncc  Symbolic  integration,  for  example,  is  a  cask  that  requires  chaining 
together  several  integration  rules  jo  obtain  a  solution  The  articles  on  learning 
from  examples  consider  these  three  cbvsses  of  performance  tasks  and  their 
corresponding  learning  methods. 

As  the  performance  task  becomes  more  complex  and  the  knowledge  base 
grows  in  sire,  the  problems  of  integrating  nrw  r t  le:  and  diagnn.ung  incorrect 
rule*  become  more  complicated.  The  integration  problem  -  that  is,  the  prob¬ 
lem  of  integrating  a  new  rule  into  an  existing  set  of  rules  — is  dillieult,  because 
tlic  learning  system  must  consider  possible  interactions  between  the  now  rule 
and  the  previous  rules.  During  the  construction  of  the  MYt'SN  system,  for 
example,  there  were  several  cases  in  which  a  new  rule  caused  existing  rules  to 
be  applied  incorrectly  or  to  cease  being  applied  altogether  (see  Article  VlU.Ul). 

The  problem  of  diagnosing  incorrect  rules-  also  known  as  the  eredit- 
aa.ngnment  problem  jMinsky,  Iljliii)  —can  bo  very  difficult  in  systems  that 
perform  a  sequence  of  actions  Wlore  receiving  any  fee  iback.  Consider,  for 
example,  the  problem  of  learning  to  play  chess  by  first  playing  a  complete 
game,  then  determining  who  woujaud  lost,  and  finally  updating  the  knowledge 
base  accordingly  The  credit  assignment  problem  is  the  problem  ol  assigning 
credit  or  blame  to  the  individual  decisions  that  led  to  some  overall  result — in 
this  ca*?,  the  individual  chess  moves  that  contributed  most  to  the  win  or  loss. 

The  second  important  issue  related  to  the  performance  task  is  the  role  of 
the  performance  element  in  providing  feedback  to  the  learning  element.  All 
learning  systems  must  have  some  w  iy  of  evaluating  the  hypotheses  that  have 
been  proposed  by  the  learning  element.  Some  programs  have  a  separate  body 
or  knowledge  for  such  evaluation.  The  AM  program,  for  example,  has  many 
heuristic  rules  that  assess  the  intoreslingness  of  the  new  roncepts  developed  by 
the  learning  element.  A  more  frequently  used  technique,  however,  is  to  have 
the  environment,  often  a  teacher,  provide  an  external  performance  flandard. 
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i'I.ei.  in  ob-ervmg  iiow  well  (In-  performance  element  13  doing  relative  to  this 
slumlord,  the  system  can  evaluate  its  current  store  of  hypotheses. 

In  y  items  that  learn  a  single  concept  from  training  instances,  the  per- 
formame  standard  us  the  correct  classification  of  each  training  instance  (as  to 
whether  it  is,  or  is  not,  an  instance  of  the  concept  to  he  learned).  In  most 
syst.-ms,  the  training  instances  arc  prerlussificd  by  a  reliable  teacher.  In  the 
'.'hi. i  ),  NilllAi.  sy:,ti  m  (see  Article  XIV  Oth),  the  performance  standard  is 
■  he  i.  ’ii  11  mass  spectrum  produced  when  a  molecule  of  known  structure  ia 
;'.i.i,  el  ,;i  the  'li.ies  -pec rometer 

i'l.e  third  issue  regarding  the  performance  task  is  (ho  transparency  of  the 
perhirmancr  element.  For  1  he  learning  element  to  assign  credit  or  blame  to 
: :id u.  iCial  rules  m  the  knowledge  base,  it  is  useful  for  the  learning  element 
to  h  i.e  iccess  to  'he  internal  actions  of  the  performauee  element.  Consider 
i.;,iiii  :he  problem  of  learning  how  to  play  chess  ll  the  learning  element 
is  given  a  'race  of  all  the  moves  that  were  considered  by  the  performance 
eV'iient  ir other  than  only  those  moves  that  were  actually  chosen),  the  crcdit- 
.I'sigumeiit  problem  is  easier  to  solve. 

f’i.rviru;  if  the  Chapter 

In  'he  previous  section,  we  discussed  the  interaction  between  the  infor¬ 
mation  provided  hv  the  environment  and  the  problems  that  are  presented 
•o  the  learning  element  From  this  analysis,  four  learning  situations  could 
tie  discerned.  In  this  section,  we  discuss  these  four  situations  in  detail  and 
give  on  example  of  a  learning  problem  m  each  situation.  The  remainder  of 
this  ,  iiap’er  ,s  organized  around  these  four  situations,  with  a  separate  set  of 
artudes  devoted  t.o  each. 

Rote,  learning  The  -amplest  learning  situation  is  one  in  which  the 
environment  supplies  knowledge  in  a  form  that  can  bo  used  directly  by  the 
performance  element.  The  learning  system  does  not  need  to  do  any  processing 
to  understand  or  interpret  the  information  supplied  by  the  environment.  AH 
it  must  do  is  memorize  the  incoming  information  for  later  use.  This  is  a  form 
of  rote  learning  -  if  it  is  considered  learning  at  all.  Virtually  every  computer 
system  can  be  said  to  do  rote  learning  insofar  as  it  stores  instructions  for 
performing  a  task.  _ 

An  important  A(  study  of  mte  learning  was  undertaken  by  Samuel  (1959, 

I  I’irv  lie  developed  a  clieekers-playmg  program  that  was  able  to  improve 
t<  port'n; -mance  by  memorizing  every  board  position  that  it  evaluated.  The 
program  med  a  standard  mimmax  look-ahead  search  (see  Chap.  I,  in  Vol.  l) 
that  evaluated  potent  i.d  future  hoard  positions.  A  simple  polynomial  evalua¬ 
tion  fuiiei ion  measured  hoard  pro,  Tties  such  as  center  control,  fork  threats, 
ami  possible  exchanges.  In  terms  of  our  primitive  learning-system  model,  the 
look-ahead  search  portion  of  Samuel's  program  served  .vs  the  “environment." 
It  supplied  the  learning  element  with  board  positions  and  their  backed-up 
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minimax  values.  The  learning  element  simply  stored  these  board  positions 
and  indexed  them  for  rapid  retrieval.  Inlerrstirglv,  the  look-ahead  search 
portion  of  Samuel’s  program  also  served  .as  par'  of  the  performance  element 
that  played  a  game  of  checkers  against  an  op  onent.  It  used  the  previously 
memorued  board  positions  to  improve  the  speed  and  depth  of  it.s  look-ahead 
scarrh  during  subsequent  games. 

Learning  by  being  told — Advice-taking.  When  a  system  is  given 
vague,  general-purpose  knowledge  or  advice,  it  must  transform  this  high-level 
knowledge  into  a  form  that  can  be  used  readily  by  the  performance  element. 
This  transformation  is  called  operationalization.  The  system  must  understand 
and  interpret  the  high-level  knowledge  and  reiate  it  to  what  it  already  knows. 
Operationalization  is  an  active  process  that  can  involve  such  activities  as 
deducing  the  consequences  of  what  it  h.us  been  told,  making  assumptions  and 
“filling  in  the  details.”  and  deciding  when  to  ask  for  more  advice.  McCarthy’s 
(1958)  proposal  for  an  ‘advice  laker”  was  the  first  description  of  a  system  that 
could  learn  by  being  told.  More  recent  work  in  the  area  of  learning  by  being 
told  includes  the  TClltKSIAS  program  (Davis,  1970)  and  Mostow's  program 
I’ 00  (Mostow  and  Hayes-Roth,  1979;  Mostow,  1981). 

FOO,  for  example,  is  told  the  rules  of  tile  game  of  Hearts  and  is  given  vague 
strategic  advice  such  as  “Avoid  taking  points.”  It  operationalizes  this  advice 
into  specific  strategies  such  as  i’lay  lower  than  the  highest  card  so  far  in  the 
suit  led."  This  kind  of  operationalization  is  similar  to  the  kind  of  processing 
performed  by  ordinary  language  compilers  that  convert  uncxceulahle  high- 
level  languages  into  directly  intcrpretahle  machine  code.  In  the  same  trivial 
sense  that  every  computer  system  can  be  said  to  do  role  learning,  every 
system  can  also  be  said  to  learn  by  being  toid:  Advice  in  the  form  of  a  high- 
level  language  program  is  compiled  and  assembled  into  an  executable  object 
program. 

Learning  from  examples — Induction.  One  way  to  teach  a  system 
how  to  tierform  a  Lisk  is  to  present  it.  with  examples  of  how  it  should  behave. 
The  system  must  then  generalize  these  examples  to  tind  higher  level  rules  that 
can  be  applied  to  guide  the  performance  element.  Examples  can  be  viewed  as 
being  pieces  of  very  specific  knowledge  that  cannot  be  used  efficiently  by  the 
performance  element.  These  are  transformed  into  more  general,  higher  level 
pieces  of  knowledge  that  can  be  used  effectively. 

For  example,  consider  the  problem  of  teaching  a  program  to  recognize 
poker  hands  that  contain  a  pair.  The  program  would  be  presented  with  sample 
hands  that,  it  is  told,  contain  pairs.  Here  is  such  a  training  instance: 

1  of  clubs,  I  of  spades,  5  of  diamonds,  li  of  hearts,  jack  of  diamonds. 

This  training  example  is  a  very  specific  piece  of  knowledge.  If  the  program 
merely  memorized  it  (by  rote  learning),  it  would  now  know  that  the  hand 

4  of  clubs,  4  of  spades,  5  of  diamonds,  8  of  hearts,  jack  of  diamonds 
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contains  a  pair.  It  would  not  know  that  the  hand 

4  oi'  clubs,  4  of  spades,  5  of  diamonds,  fi  of  hearts,  9  of  diamonds 

also  contains  .  air,  since  the  program  has  not  generalized  its  knowledge.  To 
recogniic  all  possible  pair  hands,  the  program  needs  to  discover  that  the  hand 
must  contain  two  cards  of  the  same  rank  and  that  the  remaining  cards  arc 
irrelevant.  The  generalisation  of  knowledge  to  make  it  apply  to  a  broader 
class  of  situations  is  the  key  inference  process  in  learning  from  examples. 

Learning  by  analogy.  If  a  system  has  available  to  it  a  knowledge  base 
for  a  related  performance  task,  it  may  be  able  to  improve  its  own  performance 
by  recognizing  analogies  and  transferring  the  relevant  knowledge  from  the 
other  knowledge  base.  Tims  far,  however,  very  little  work  lias  been  done 
in  this  area.  Some  of  the  open  research  questions  arc:  What  exactly  is  an 
analogy?  (low  arc  analogies  recognized?  llow  is  the  relevant  knowledge 
transferred  from  the  analogous  knowledge  base  and  applied  to  accomplish 
the  desired  tasks? 

Suppose,  for  example,  that  a  program  has  available  to  it  a  knowledge 
base  describing  how  to  diagnose  diseases  in  human  beings  and  someone  wants 
to  use  the  same  program  to  diagnose  computer-system  failures.  By  finding 
the  proper  analogies,  the  program  can  develop  classes  of  computer  failures 
(“diseases”)  and  possible  solutions  (“therapies”).  Diagnostic  procedures  can 
be  transferred  as  the  analogy  is  developed  (c.g.,  x-rays  can  be  analogized  to 
core  dumps). 

We  do  not  include  in  this  chapter  any  articles  discussing  learning  by 
analogy,  since  this  area  has  not  received  much  attention. 

Conclusion 

This  introduction  has  surveyed  /VI  research  on  learning  and  presented  a 
simple  model  of  Al  learning  systems.  The  model  has  been  used  to  discuss  the 
factors  that  bear  upon  the  design  of  the  learning  clement.  These  include  the 
level  and  quality  of  the  information  provided  by  the  environment,  the  form 
and  content  of  the  knowledge  base,  and  the  complexity  and  transparency  of 
the  performance  element.  Of  these  factors,  the  most  important  is  the  level  of 
the  information  provided  by  the  environment.  This  has  been  used  to  develop 
the  simple  taxonomy  of  four  learning  situations  that  provides  an  organization 
for  the  remainder  of  this  chapter. 
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ROTE  LEARNING  is  memorization;  it  is  saving  new  knowledge  so  that  when 
it  is  needed  again,  the  only  problem  will  be  retrieval,  rather  than  a  repeated 
computation,  inference,  or  query.  Two  extreme  perspectives  on  rote  learning 
arc  possible.  One  view  says  that  memorization  is  such  a  basic  necessity  for  any 
intelligent  program  that  it  cannot  be  considered  a  separate  learning  process 
at  all.  An  alternate  view  regards  memorization  as  a  complex  subject  that 
is  vital  to  any  effective  cognitive  system  and  well  worth  study  and  modeling 
on  its  own.  This  article  takes  a  less  extreme  perspective,  partly  because  the 
former  viewpoint  leaves  nothing  to  say  about  rote  learning  and  the  latter 
would  require  more  than  is  appropriate  here.  (See  Chap.  XI  for  a  discussion 
of  AI  investigations  into  human  memory  processes.) 

Rote  memorization  can  be  seen  as  an  elementary  learning  process,  not 
powerful  enough  to  accomplish  intelligent  learning  on  its  own  (because  not 
everything  that  needs  to  be  known  in  any  nontrivial  domain  can  be  memo¬ 
rized),  but  an  inherent  and  important  part  of  any  learning  system.  All  learning 
systems  must  remember  the  knowledge  that  they  have  acquired  so  that  it  can 
be  applied  in  the  future.  In  a  rote-learning  system,  the  knowledge  has  already 
been  gained  by  some  method  and  is  in  a  directly  usable  form.  Other,  more 
sophisticated  learning  systems  first  acquire  the  knowledge  from  examples  or 
from  advice  and  then  memorize  it.  Thus,  ail  learning  systems  are  built  on 
a  rote-learning  process  that  stores,  maintains,  and  retrieves  knowledge  in  a 
knowledge  base. 

Rote  learning  works  by  taking  problems  that  the  performance  clement 
has  solved  and  memorizing  the  problem  and  its  solution.  Viewed  abstractly, 
the  performance  element  can  be  thought  of  .as  some  function,  /,  that  takes  an 
input  pattern  (Y|,  . .. ,  X„)  and  computes  an  output  value  (Y|,  . .  . ,  Yp).  A  rote 
memory  for  /  simply  stores  the  associated  pair  [(.V|,  . . .  ,Xn),[Yi,  ... ,  VjJ)]  in 
memory.  During  subsequent  computations  of  f[Xt ,  ...,Yn),  the  performance 
clement  can  3imply  retrieve  (Y| . Yp)  from  memory  rather  than  recom¬ 

puting  it.  This  simple  model  of  rote  learning  is  depicted  in  Figure  1H-1.  I 

Consider,  for  example,  an  automobile  insurance  program  that  determines 
the  cost  of  repairs  for  damaged  automobiles.  The  input  pattern  is  a  descrip¬ 
tion  of  the  damaged  automobile,  including  make  and  year,  and  a  list  of  the 
damaged  portions  of  the  car.  The  output  value  is  the  estimated  cost  of  tjie 
repairs.  The  system  has  only  a  rote  memory.  To  estimate  the  cost  of  repairs, 
it  looks  in  its  memory  for  a  previous  automobile  of  the  same  make,  model. 
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Figure  B 1  —  1 .  Simple  model  of  rote  learning. 

and  damage  description  and  retrieves  the  corresponding  cost.  If  it  cannot 
find  such  an  automobile,  it  uses  a  set  of  rules  (published  by  a  consortium 
of  insurance  companies)  to  guess  the  cost  of  the  repairs  and  then  saves  its 
estimate  for  future  use.  This  computed  estimate,  along  with  the  description 
of  the  damaged  automobile,  forms  the  associated  pair  that  is  memorized. 

I  Lcnat,  Ilaycs-Roth,  and  Klalir  (197!))  provide  an  interesting  perspective 
on  rote  learning.  They  point  out  that  rote  learning  (or  “caching”)  can  be 
viewed  as  the  lowest  level  of  a  hierarchy  of  data  reductions.  The  reductions 
arc  analogous  to  computer  language  compilation:  The  purpose  is  to  refine  the 
original  information  down  to  the  essentials  for  performance.  In  rote  learning, 
we  generally  attempt  to  save  the  input/output  details  of  some  calculation  and 
so  bypass  a  future  need  lor  the  intermediate  computation  process.  Thus,  a 
calculation  task,  if  valuable  and  stable  enough  to  be  remembered,  is  reduced 
to  an  access  task  (see  Fig.  HI -2,  below). 

i  Just  as  calculations  can  be  reduced  to  retrievals  by  caching,  so  can  other 
inferential  processes  be  reduced  to  simpler  tasks.  For  instance,  deductions  can 
be  reduced  to  calculations.  The  first  time  we  are  asked  to  solve  a  quadratic 
equation,  for  example,  we  must  follow  lengthy  deductive  chains  to  find  the 
quadratic  formula.  Subsequently,  we  can  simply  compute  the  roots  of  a 
quadratic  equation  directly  from  the  formula.  We  have  distilled  the  results 
Of  a  deductive  search  and  summarized  them  as  an  efficient  algorithm.  Going 
one  step  further,  the  process  of  induction  can  convert  a  huge  body  of  training 
instances  into  a  single  heuristic  rule.  Once  again,  the  primary  gain  is  in 
efficiency:  It  i3  no  longer  necessary  to  consult  a  huge  body  of  examples  to  find 
out  how  to  behave  in  a  new  situation. 

ACCESS - *►  CALCU  LATE - «-  OEDl'CK  - ►  INIM'CE 


Cache  Algorithm  Heuristic 

(Role)  or  Theorem  Rule 


Figure  Bl-2.  Spectrum  of  data  reductions  (from  Lcnat  et  at.,  1979). 
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hiuet  in  the  Design  of  Rote-learning  Syttem» 

There  arc  three  important  issues  relevant  to  rote-learning  systems:  mem¬ 
ory  organisation,  stability,  and  the  store- versus- compute  trade-off. 

Memory  organisation.  Rote  learning  is  useful  only  if  it  takes  less  time 
to  retrieve  the  desired  item  than  it  does  to  recompute  it.  Retrieval  can  be 
made  very  rapid  by  properly  organising  memory.  Consequently,  indexing, 
sorting,  and  hashing  techniques  have  been  thoroughly  studied  in  the  computer 
science  subficlds  of  data  structures  (Aho,  Ifoprroft,  and  Ullman,  1074)  and 
database  systems  (Wiederhold,  1077;  Date,  1077;  Ullman,  1980). 

Stability  of  the  environment  and  the  frame  problem.  Rote  learn¬ 
ing  is  not  very  helpful  or  effective  in  a  rapidly  changing  environment.  One 
important  assumption  underlying  rote  learning  is  that  information  stored  at 
one  time  will  still  be  valid  later.  If,  however,  the  information  changes  fre¬ 
quently,  this  assumption  can  be  violated.  Consider,  for  example,  information 
gathered  about  automobile  repair  costs  during  the  early  1950s.  Such  informa¬ 
tion  would  be  of  little  value  for  estimating  automobile  repair  costs  in  the  1980s 
because  the  world  has  changed  in  critical  ways:  The  makes  and  models  of 
cars  presently  manufactured  did  not  exist  in  the  1950s;  furthermore,  inflation 
has  made  the  direct  comparison  of  dollar  costs  impossible.  A  rote-learning 
system  must  be  able  to  detect  when  the  world  has  changed  in  such  a  way  as 
to  make  stored  information  invalid.  This  is  an  instance  of  the  frame  problem 
(see  Chap.  Ill,  in  Vol.  l). 

Some  solutions  to  this  problem  have  been  developed.  One  approach  is  to 
monitor  every  change  to  the  world  and  keep  the  stored  information  always 
up  to  date.  Thus,  when  an  old  model  of  automobile  is  discontinued,  all 
information  about  that  model  could  be  removed  from  the  knowledge  base. 
This  approach  requires  that  the  relevant  aspects  of  the  world  be  continually 
monitored. 

A  second  approach  to  solving  the  frame  problem  is  to  check,  when  the 
information  is  retrieved  for  use,  that  it  is  still  valid.  Typically,  this  requires, 
storing,  along  with  the  information  itself,  some  additional  data  about  the 
state  of  the  world  at  the  time  the  information  was  memorized.  When  the 
information  is  retrieved,  the  stored  state  can  be  compared  to  the  current 
stale,  and  the  system  can  determine  whether  or  not  the  information  is  still 
valid.  This  approach  requires  that  the  relevant  aspects  of  the  world  (such  as 
the  current  value  of  the  dollar)  be  anticipated  and  stored  with  the  data. 

Many  other  approaches  are  possible.  If  the  system  can  determine  how 
the  world  has  changed  (o.g.,  by  knowing  the  inflation  rate),  it  may  be  able 
to  make  appropriate  modifications  to  restore  the  validity  of  the  memorized 
information  (e.g.,  by  converting  the  1950  prices  into  1980  equivalents). 

Store-versus-computc  trade-off.  Since  the  primary  goal  of  rote  learn¬ 
ing  is  to  improve  the  overall  performance  of  the  system,  it  is  important  that 
the  rote-learning  process  itself  does  not  decrease  the  efficiency  of  the  system. 
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It  is  conceivable,  for  instance,  that  the  cost  of  storing  and  retrieving  the 
memorized  information  is  greater  than  the  cost  of  recomputing  it.  This  is 
certainly  the  case  with  the  multiplication  of  two  numbers;  virtually  all  com¬ 
puters  recompute  the  product  of  two  numbers  rather  than  store  a  large  mul¬ 
tiplication  table. 

There  are  two  basic  approaches  to  resolving  the  store- versus-compute 
trade-olT.  One  is  to  decide  at  the  time  the  information  is  first  available 
whether  or  not  it  should  be  stored  for  later  use.  A  cost-benelit  analysis 
can  be  performed  that  weighs  *he  amount  of  storage  space  consumed  by 
the  information  and  the  cost  of  recomputing  it  against  the  likelihood  that 
the  information  will  be  needed  in  the  future.  A  second  approach  is  to  go 
ahead  and  store  the  information  and  later  decide  whether  or  not  to  forget 
it.  This  procedure,  called  selective  forgetting,  allows  the  system  to  determine 
empirically  which  items  of  information  arc  most  frequently  reused. 

One  of  the  most  common  selective- forgetting  techniques  is  called  the  least 
recently  used  (LRU)  replacement  algorithm.  Each  item  stored  in  memory 
is  tagged  with  the  time  when  it  was  last  retrieved.  Every  time  an  item 
•s  retrieved,  its  “time  of  last  use”  is  updated.  When  a  new  item  is  to  be 
memorized,  the  least  recently  liscd  item  is  forgotten  and  replaced  by  the  new 
one.  Variations  on  this  scheme  take  into  consideration  the  amount  of  storage 
required  for  the  item,  the  cost  of  recomputing  the  item,  and  so  on. 

References 

Lcnat,  Haycs-Roth,  and  Klahr  (1979)  provide  an  excellent  discussion  of 
various  learning  methods,  including  rote  learning.  Samuel  (1959)  remains  the 
best  example  of  research  into  rote  processes. 


B2.  Rote  Learning  in  Samuel’s  Checkers  Player 


SAMUEL  conducted  a  series  of  studies  (1953,  1967)  on  how  to  get  a  com* 
puter  to  learn  to  play  checkers.  Among  the  earliest  investigations  of  machine 
learning,  they  remain  some  of  the  most  successful  both  in  terms  of  improved 
performance  (i.e.,  demonstrated  improvements  in  the  f*>rformanee  element) 
and  in  terms  of  lessons  for  AI.  His  experiments  with  three  different  learn* 
ing  methods — rote  learning,  polynomial  evaluation  functions,  and  signature 
tables — showed  that  significant  improvement  in  playing  checkers  could  be 
obtained.  This  article  focuses  on  his  thorough  analysis  of  the  question  of  how 
much  rote  learning  alone  can  contribute  to  expertise  and  improved  perfor¬ 
mance.  Other  aspects  of  Samuel’s  work  are  discussed  late,  in  Article  XIV.tMm. 


The  Game  of  Checker »  at  a  Performance  Teak 

Checkers  is  a  difficult  game  to  play  well.  It  is  estimated  that  a  full  explo¬ 
ration  of  all  possible  moves  in  che-kers  would  require  roughly  10’°  moves. 
Samuel’s  program  was  provided  with  procedures  for  playing  the  game  cor¬ 
rectly;  that  is,  the  rules  of  checkers  were  incorporated  into  the  program.  He 
sought  to  have  the  program  learn  to  play  well  by  having  it  memorise  and 
recall  board  positions  that  it  had  encountered  in  previous  games. 

At  each  turn,  Samuel's  program  chose  its  move  by  conducting  a  minimax 
game-tree  search  (see  Articles  11.83  and  II.C5,  in  Vol.  l).  In  principle,  of  course, 
a  program  could  try  all  possible  moves  and  all  possible  consequences  of  each 
move  and  thereby  search  the  entire  checkers  game-tree.  Such  a  calculation — 
which  is  equivalent  to  playing  every  possible  game  of  checkers — is  not  feasible 
because  the  search  space  is  too  large.  Every  potential  move  by  one  player 
generally  leads  to  many  possible  countermoves,  each  of  which  has  still  more 
possible  responses.  The  resulting  combinatorial  explosion  (sec  Article  H.A,  in 
Vol.  l)  prevents  any  program  from  searching  the  whole  tree. 

Consequently,  the  standard  approach  to  conducting  a  game-tree  search  is 
to  search  only  a  few  moves  (and  countermoves)  into  the  future  and  then  apply 
a  static  evaluation  function  to  estimate  which  side  is  winning.  The  program 
then  chooses  the  move  that  leads  to  the  best  estimated  position. 

Suppose,  for  example,  that  at  some  board  position,  A,  it  is  the  program's 
turn  to  move  (sec  Fig.  B2-1).  The  program  searches  ahead  three  moves 
by  considering  first  all  possible  moves  that  it  could  make,  then  all  possible 
countermoves  available  to  its  opponent,  and  finally  ail  possible  replies  to  those 
countermoves.  At  this  point,  the  program  applies  a  static  evaluation  function 
to  estimate  its  net  advantage  at  each  of  the  board  positions  shown  on  the 
right  in  the  figure.  These  values  are  then  “backed  up”  by  assuming  that 
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Figure  B2-1.  An  example  of  a  minimax  game-tree  search. 

the  opponent  will  always  take  the  move  I  that  is  worst  for  the  computer  (and 
vice  versa).  Thus,  the  best  move  for  the  program  is  the  one  that  leads 
position  D.  The  program  expects  that  the  opponent  will  countermove  to 
to  which  the  program  can  reply  with  D.  The  static  evaluation  function  has 
estimated  the  value  of  D  to  be  8,  so  this  is  the  backcd-up  value  of  position  A. 


Improving  the  Performance  of  the  Cheekeh  Player 


There  are  two  basic  ways  to  improve 
search.  One  method  is  to  search  farther 


approximate  a  full  search  of  the  tree.  This  is  known  as  improving  the  look¬ 


ahead  power  of  the  program.  The  other 


the  performance  of  a  game-tree 
into  the  future  and  thus  better 


method  b  to  improve  the  static 
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evaluation  function,  so  that  the  estimate!)  value  of  each  hoard  position  is  more 
accurate.  Samuel’s  rote-learning  studies  aimed  at  improving  the  look-ahead 
power  by  memorizing  the  backed-up  values  of  board  positions.  The  techniques 
discussed  in  Article  XFV.DSa  address  the  problem  of  improving  the  evaluation 

function. 

The  rote-learning  approach  employed  by  Sainuci  saved  every  board  p>  si- 
tion  encountered  during  play,  along  with  its  backed-up  value.  In  the  situation 
shown  in  figure  B2-1,  for  instance,  Samuel’s  program  would  memorize  the 
description  of  board  position  A  and  its  backed-up  value  of  8  as  an  associated 
pair,  [A, 8].  When  position  A  is  encountered  in  subsequent  games,  its  evalua¬ 
tion  score  is  retrieved  from  memory  rather  than  recomputed.  This  makes  the 
program  more  efficient,  because  it  does  not  have  to  compute  the  value  for  A 
with  the  static  cvalution  function. 

There  is  a  more  important  benefit  of  retrieving  the  backed-up  value  of 
A  from  memory,  however.  The  memorized  value  of  A  is  more  accurate  than 
the  static  value  of  A,  because  It  is  based  on  a  look-ahead  search.  Thus, 
the  look-ahead  power  of  the  program  is  improved.  Figure  132  -2  shows  an 
example  of  this  improvement.  The  program  is  considering  which  move  to 
make  at  position  E.  It  searches  ahead  three  moves  and  then  applies  the  static 
evaluation  function.  For  position  A,  however,  the  program  is  able  to  retrieve 
the  memorized  value  baaed  on  the  previous  search  to  position  I). 

This  appro  ,;h  improves  the  effective  search  depth  for  E.  As  more  and 
more  positions  are  memorized,  the  effective  search  depth  improves  from  its 


Figure  132-2.  Improving  look-ahead  power  by  rote  learning. 
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original  value  of  3  moves,  up  to  6,  then  to  0,  and  so  on.  Rote  learning  is  thus 
used  in  Samuel’s  program  to  save  the  results  of  previous  partial  game-tree 
searches,  so  that  they  can  gradually  be  extended  and  deepened,  llolc  learning 
converts  a  computation  (tree  search)  into  a  retrieval  from  memory. 

Kltmory  Organization 

Scj  ">cl  employed  several  clever  techniques  to  store  the  evaluated  hoard 
positions,  _>o  that  they  took  up  little  space  and  could  be  retrieved  rapidly.  To 
store  the  positions  compactly,  Samuel  took  advantage  of  several  symmetries 
(e.g.,  positions  in  whicu  it  was  Red’s  turn  to  move  were  converted  into  the 
corresponding  Black-to-move  positions;  king  positions  arc  symmetric  in  two 
ways).  Efficient  retrieval  was  accomplished  by  indexing  the  boards  according 
to  many  different  characteristics  (including  the  number  of  pieces  on  the  board, 
presence  or  absence  of  kings,  and  piece  advantage)  and  writing  them  onto 
a  tape  in  the  order  they  would  most  likely  be  needed  during  a  game.  The 
use  of  magnetic  tape  was  necessary  because  the  program  was  running  on  a 
relatively  small  IBM  704  computer,  and  only  a  few  board  positions  could  oe 
kept  in  the  computer’s  core  memory.  During  rote  learning,  the  program  would 
accumulate  a  number  of  board  positions  before  reading,  sorting,  and  rewriting 
them  onto  the  memory  tape. 

Samuel  resolved  the  store- vorsus-eompute  trade-off  with  a  variation  of 
frost  rectntly  used  (LRU)  replacement.  Each  board  position  was  given  an  age. 
Whenever  a  position  was  retrieved  from  memory,  its  age  was  divided  by  2. 
When  the  memory  tape  was  rewritten,  the  ages  of  all  stored  positions  were 
increased  by  1,  and  very  old  positions  were  forgotten — that  is,  not  written 
hack  onto  tape. 

Retult i 

The  program  was  trained  in  several  ways:  by  playing  against  itself,  by 
playing  against  people  (including  some  checkers  masters),  and  by  following 
published  games  between  master  players  (so-called  book  games).  After  train¬ 
ing,  the  memory  tape  contained  roughly  53,000  positions.  As  the  program 
learned  more,  it  improved  slowly  but  steadily,  becoming,  in  Samuel’s  words,  a 
“rather  bettcr-than-average  novice,  but  dclinitely  not  ...  an  expert”  (Samuel, 
195!),  p.  218).  Success  in  learning  varied  markedly  depending  on  the  phase  of 
the  game.  The  program  became  capable  of  playing  a  very  good  opening  game, 
since  the  number  of  hoard  variations  is  relatively  small  near  tlur  start  of  ll.c 
game.  Performance  during  the  midgamc,  with  its  Tar  greater  range  of  possible 
configurations,  did  not  greatly  improve  with  rote  learning.  During  the  end 
game,  the  program  became  able  to  recognize  winning  and  losing  positions  well 
in  advance,  but  it  needed  some  improvement  before  it  was  able  to  force  the 
game  to  a  successful  conclusion  (see  below). 
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On  the  whole,  Samuel’s  experiments  demonstrated  that  significant  and 
measurable  learning  can  result  from  rote  processes  alone,  but  that  on  its  own, 
rc*»  learning  is  limited  in  several  ways.  The  first  and  most  obvious  limitation 
is  in  storage  space  and  retrieval.  One  question  that  interested  Samuel  is  the 
following:  If  rote  learning  produces  steady  improvement  of  performance  as 
it  gathers  new  positions  (up  to  a  limit  determined  by  available  space  and 
the  efficiency  of  indexing  algorithms),  could  it  ever  reach  a  performance  level 
considered  expert  before  exceeding  the  storage  and  indexing  limits?  If  so,  how 
much  data  would  it  need  to  remember,  anti  how  long  would  it  take  to  gather 
.he  data? 

Samuel  estimated  that  his  progr  tin  would  need  to  memorize  about  one 
million  positions  to  approximate  a  m:i-t?r  level  of  checkers  play.  Unfortunately, 
even  a  system  with  sufficient  storage  capacity  and  rapid  retrieval  methods 
would  require  an  impractical  amount  of  machine  playing  in  order  to  gather  a 
million  useful  positions.  However,  Samuel  suggests  that  even  this  long  acqui¬ 
sition  period  would  be  shorter  than  the  time  taken  by  humans  to  improve 
from  complete  beginners  to  masters. 

The  inability  of  the  program  actually  to  effect  a  win  once  it  had  a  winning 
position  was  a  curious  problem.  It  was  caused  by  the  mesa  effect  (Minsky, 
1963) — that  is,  once  the  program  has  found  a  winning  position,  all  moves 
look  equally  good,  and  the  program  tends  to  wander  aimlessly.  Samuel  solved 
the  problem  by  storing,  along  with  each  board  position  and  value,  the  length 
of  the  search  path  that  was  used  to  compute  the  board  value.  The  move- 
selection  procedure  was  modified  to  select  the  best  move  that  also  had  the 
shortest  associated  search  distance.  This  change  gave  the  program  a  sense  of 
direction,  so  that  it  was  able  to  press  forward  to  win  the  game  (or  stall  as 
much  as  possible  to  avoid  losing  a  game). 

Another  interesting  problem  arose  when  Samuel  attempted  to  combine 
rote  learning  with  learning  techniques  that  modified  the  static  evaluation  func¬ 
tion.  Unfortunately,  changes  to  the  evaluation  function  tended  to  invalidate 
previously  memoriied  positions  (sec  Article  XIV. Bl,  on  the  frame  problem). 
Samuel’s  solution  was  to  avoid  this  problem  by  postponing  rote  learning  until 
the  evaluation  function  had  been  effectively  learned. 

Conclusion 

Besides  showing  that  real  improvement  of  performance  could  be  gained 
by  the  conceptually  simplest  form  of  learning — rote  memorization — Samuel 
identified  and  elaborated  several  issues  ll..,t  need  to  be  handled  if  rote  is 
to  offer  significant  gains.  In  general,  the  value  of  rote  learning  is  to  gain 
problem-solving  power  in  the  form  of  speed.  By  retrieving  the  stored  results 
of  extensive  computations,  the  program  can  proceed  deeper  in  its  reasoning. 
The  price  is  storage  space,  access  time,  and  effort  in  organizing  the  stored 
knowledge. 
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Samuel  found  that  for  rote  learning  to  be  effective,  knowledge  had  to 
be  carefully  organiied  for  efficient  retrieval,  stabiliicd  to  avoid  using  values 
whose  meanings  had  changed,  augmented  with  search-depth  information,  and 
selectively  forgotten  so  that  only  the  most  useful  information  would  tend  to 
be  saved.  In  the  case  -jf  Samuel’s  checkers  player,  rote  learning  may  have  had 
enough  power  on  its  own  to  lead  eventually  to  expert  performance,  but  the 
time  and  space  required  for  that  much  improvement  were  beyond  the  available 
resources. 

Reference* 

Samuel  (1959)  describes  the  rote-learning  research  in  detail. 


C.  LEARNING  BY  TAKING  AD  VICE 
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In  ONE  of  the  earliest  AI  papers  on  learning,  McCarthy  ( 1  proposed  the 
creation  of  an  advice-taking  system  that  could  accept  advice  and  make  u?c 
of  it  to  plan  and  execute  actions  in  the  world.  Until  the  late  1070s,  however, 
there  were  very  few  attempts  to  write  programs  that  could  learn  by  taking 
advice.  The  recent  emphasis  in  AI  on  expert  svstems  has  focused  new  attention 
on  the  problem  of  converting  expert  advice  into  expert  performance  (see  Barr, 
Bennett,  and  Clanccy,  1979). 

Research  on  advice-taking  systems  has  followed  two  major  paths.  One 
approach  has  been  to  develop  systems  that  accept  abstract,  high-level  advice 
and  convert  it  into  rules  that  can  effectively  guide  the  performance  element. 
This  research  seeks  to  automate  all  phases  of  the  advice-taking  process.  The 
other  approach  has  been  to  develop  sophisticated  tools — such  as  knowledge¬ 
base  editing  and  debugging  aids- -that  make  it  easier  for  the  expert  to  trans¬ 
form  his  own  abstract  expertise  into  detailed  rules.  In  this  second  approach, 
the  expert  is  an  integral  part  of  the  learning  system,  detecting  and  diagnosing 
bugs  and  repairing  and  refining  the  knowledge  base.  The  former  approach 
shows  promise  of  eventually  developing  completely  instructable  systems,  while 
the  latter  approach  has  proved  invaluable  for  creating  knowledge-based  expert 
systems.  This  article  describes  both  of  these  research  paths.  We  will  discuss 
the  more  highly  automated  approach  first  and  return  later  to  the  research  on 
knowledge-base  editing  and  debugging  aids. 

Slept  [or  Automatic  Advice-taking 

Hayes-Roth,  Klahr,  and  Mostow  (1980,  1981)  provide  an  outline  of  the 
processes  required  to  convert  expert  advice  into  program  performance.  This 
outline  can  be  summarised  as  follows: 

1.  Regueit — request  advice  from  expert, 

2.  Interpret—  assimilate  into  internal  representation, 

3.  Operationalise  -  convert  into  usable  form, 

•1.  Integrate  -integrate  into  knowledge  base, 

5.  Evaluate — evaluate  resulting  actions  of  performance  element. 

Request.  The  first  step  is  for  the  program  to  request  advice  from  the 
expert.  The  request  can  be  simple  —just  asking  the  expert  to  give  some 
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general  advice  —or  it  can  he  sophisticated  -identifying  a  shortcoming  in  the 
knowledge  base  and  ask  i ng  the  expert  hove  to  repair  it.  Some  systems  are 
completely  passive  and  simply  wait  lor  the  expert  to  interrupt  them  and 
provide  advice,  while  others  are  very  rarel'ul  to  focus  the  attention  of  the 
expert  on  a  particular  problem. 

Interpret.  The  next  step  in  advice-taking  is  to  accept  the  advice  and 
represent  it  internally.  McCarthy  ( I'ldS)  points  out  that  in  order  for  a  program 
to  accept  advice,  the  program  must  have  an  •'pistemoioijicall'j  adequate  repre¬ 
sentation  for  the  adviee  isre  Article  lll.t/i,  in  \ol.  II.  that  is.  a  representation 
that  is  capable  of  expression;  the  advice  without  losing  any  information.  This 
interpretation  step  can  he  very  diilicuit  if  the  advice  is  given  in  a  natural  lan¬ 
guage.  The  program  must  understand  the  natural  language  sullieiently  well 
to  convert  it  into  an  unambiguous  internal  representation  See  Chapter  IV, 
in  Volume  1,  for  a  detailed  survey  of  Al  research  into  natural-language  under¬ 
standing 

Operationalize.  Once  the  advice  has  been  accepted  and  interpreted  into 
an  unambiguous  representation,  it  still  may  not  be  directly  executable  by  the 
performance  element.  The  third  step  —  operationalization  -seeks  to  bridge  the 
gap  between  the  level  .a  which  the  advice  is  provided  and  the  level  at  which 
the  performance  element  can  apply  it. 

Mostow’s  (1081)  program  FOO,  for  example,  accepts  advice  about  how  to 
play  the  card  game  of  Hearts.  Hngiish-language  advice,  such  .is  "Avoid  taking 
points,"  is  interpreted  by  FOO’s  human  user  and  given  to  the  program  as 
the  lambda-calculus  statement  .AVOID  (TAKK-I’Oii’TS  MIC)  iCfmtlC.NT  THICK)). 
However,  even  though  this  advice  has  been  interpreted  into  an  unambiguous 
internal  representation,  it  is  still  not  operational  since  FOO  has  no  procedures 
or  methods  to  avoid  taking  points.  FOO  does  have  methods  for  selecting  and 
playing  cards,  however.  Thus,  the  advice  must  he  converted  into  a  form,  such 
as  [ACIIIICVK  (I.OW  (CARD  OF  ME))]  (i.e.,  "I’lay  a  low  card"),  that  requires  only 
these  operations. 

FOO  accomplishes  this  task  by  applying  many  different  operationalization 
methods  (see  Article  XIV. C2).  It  tries  to  re-express  the  advice,  using  known 
relationships,  until  it  can  recognize  that  one  of  its  operationalization  methods 
is  applicable.  These  methods  then  allow  it  to  develop  a  procedure  for  carrying 
out  all  or  part  of  the  advice.  The  steps  of  reformulating  the  advice  and  apply¬ 
ing  operationalization  methods  are  repeated  until  the  advice  is  completely 
executable. 

This  process  is  similar  to  the  approach  taken  by  automatic-programming 
systems  that  convert  high-level  program  .specifications  into  eltioierit  implemen¬ 
tations  (see  Chap.  X,  in  Vol.  l).  However,  unlike  those  systems,  which  seek  to 
create  provably  correct,  programs,  FOO  is  not  foolproof.  The  gap  between  the 
advice  and  the  performance  element  is  usually  loo  wide,  and  the  operationali¬ 
zation  methods  arc  usually  too  weak,  to  permit  error-free  operationalization. 
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For  example,  it  is  often  necessary  for  F00  to  make  assumptions  and  approx¬ 
imations  in  order  to  transform  the  advice.  F00  cannot  always  successfully 
“avoid  taking  points"  in  Hearts,  since  it  is  impossible  for  the  program  to  know 
the  contents  of  its  opponents’  hands.  Instead,  F00  applies  heuristic  methods 
to  reduce  the  likelihood  that  points  will  be  taken.  Its  strategy  of  playing  low 
cards  is,  consequently,  a  tentative  hypothesis  about  how  to  avoid  taking  points. 
The  tentative  hypotheses  developed  by  operationalization  must  be  tested  and 
debugged  before  they  can  be  accepted. 

Integrate.  When  knowledge  is  added  to  the  knowledge  base,  care  must 
be  taken  to  see  that  it  is  properly  integrated  (see  Article  XIV.a).  New  advice 
can  result  in  new  mistakes  if  it  takes  precedence  over  previous  knowledge  in 
situations  in  which  the  old  knowledge  is  still  correct.  Yet  the  new  advice  must 
take  precedence  in  the  intended  situations  The  learning  program  must  know 
enough  about  how  the  performance  element  applies  the  knowledge  to  be  able 
to  anticipate  and  avoid  any  bad  sidc-efTects  that  could  result  from  adding  the 
knowledge  to  the  knowledge  base. 

Two  common  problems  of  integration  are  (a)  overlapping  applicability 
and  (b)  contradictory  recommendations.  Consider  an  expert  system,  such  as 
MYCIN,  whose  knowledge  base  is  represented  as  a  set  of  production  rules. 
When  a  new  rule  is  added,  its  left-hand  side  (or  condition  part)  may  be  overly 
general,  causing  it  to  trigger  in  situations  in  which  some  other  rule  is  properly 
applicable.  One  solution  to  this  problem  is  to  specialize  the  rules,  so  that  this 
overlap  of  applicability  no  longer  occurs.  Another  approach — the  meta-rule 
approach — is  to  add  ordering  rules  ( meta-rules )  that  explicitly  indicate  which 
regular  rules  should  be  applied  before  others. 

When  the  right-hand  sides  (or  action  parts)  of  two  production  rules  recom¬ 
mend  inconsistent  actions  in  the  same  situation,  the  problem  of  contradictory 
recommendations  arises.  Again,  cither  the  right-hand  side*  •  i  be  modiGcd 
to  remove  the  contradiction  or  a  meta-rule  can  be  added  la  indicate  which 
action  should  take  precedence.  There  are  many  other  integration  problems 
aside  from  these  two  typical  ones. 

Evaluate.  Since  the  new  knowledge  received  from  the  expert  is  only 
tentative — that  is,  it  is  the  result  of  interpretation,  operationalization,  and 
integration — it  must  be  evaluated  somehow.  The  learning  system  may  be  able 
to  recognize  some  errors  and  inconsistencies  in  the  advice  when  it  integrates 
the  advice  into  the  knowledge  base.  More  fiequcntly,  however,  it  is  necessary 
to  test  the  advice  empirically  by  actually  employing  it  to  perform  some  task 
and  then  assessing  whether  the  system  is  working  properly. 

Evaluation  requires  some  performance  standard  against  which  the  actual 
behavior  of  the  system  can  be  compared.  In  some  domains,  the  performance 
standard  can  be  built  into  the  program.  Game-playing  programs,  for  example, 
can  tell  if  the  system  is  doing  well  by  whether  or  not  the  system  wins  the  game. 
In  other  domains,  however,  the  system  needs  to  set  up  detailed  expectations 
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about  how  the  new  knowledge  will  alTcct  the  performance  of  the  system.  These 
expectations  allow  the'  program  to  detect  and  locate  bugs  in  the  knowledge 
base. 

Evaluation  can  naturally  feed  back  into  the  request  step  (the  first  of 
these  five  steps).  When  the  program  detects  that  the  performance  element 
is  not  functioning  properly,  it  can  announce  this  to  the  expert  and  request 
additional  advice.  A  more  sophisticated  approach  is  for  the  program  to  do 
credit  assignment — that  is,  to  determine  which  parts  of  the  knowledge  base 
are  incorrect.  Once  the  bug  has  been  located,  the  advice-taking  system  can 
ask  the  expert  to  tell  it  how  to  repair  the  particular  piece  of  knowledge  that 
is  incorrect. 

Now  that  we  have  discussed  the  five  basic  steps  in  an  advice-taking  sys¬ 
tem,  we  describe  some  systems  that  have  been  developed  as  aids  for  creating, 
modifying,  and  debugging  large  knowledge  bases. 

Aids  [or  Knowledge-base  Maintenance 

Instead  of  fully  automating,  these  five  steps,  many  researchers  working 
on  expert  systems  have  built  tools  for  assisting  in  the  development  and  main¬ 
tenance  of  expert  knowledge  bases.  EMYC1N  (van  Mellc,  1980;  Davis,  1976), 
AGE  (Nii  and  Aiello,  1979),  and  KAS  (Reboh,  1981),  for  example,  all  provide 
certain  functions  to  assist  a  domain  expert  or  knowledge  engineer  in  carrying 
out  these  five  steps.  Particular  assistance  has  been  provided  for  integrating 
new  knowledge  into  the  knowledge  base  (intelligent  editors,  flexible  repre¬ 
sentation  languages)  and  for  evaluating  and  debugging  the  knowledge  base 
(explanation  and  tracing  facilities).  This  somiautomated  approach  to  advice¬ 
taking  places  the  knowledge  engineer  in  the  role  of  requesting,  interpreting, 
and  operationalizing  the  expert’s  advice. 

To  assist  the  knowledge  engineer,  these  systems  must  be  able  to  com¬ 
municate  circclively.  It  is  particularly  important  for  the  engineer  to  get  good 
feedback  from  the  system  during  testing  and  debugging.  Thus,  a  great  deal 
of  effort  has  been  expended  on  the  development  of  tracing  and  explanation 
facilities  for  expert  systems  (sec  Article  VII. B,  in  Vol.  II;  Davis,  1976). 

Conclusion 

Research  on  advice-taking  systems  is  still  in  its  infancy,  although  impor¬ 
tant  ideas  and  methods  arc  available  from  the  related  areas  of  natural-language 
understanding  and  automatic  programming.  Present  research  is  advancing 
along  two  paths:  the  theoretical  path  of  automatic  operationalization  of  expert 
advice  and  the  practical  path  of  providing  aids  to  help  knowledge  engineers 
build  and  debug  expert  systems.  The  development  of  fully  automatic  systems 
remains  an  active  research  area. 
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A  few  AI  systems  have  been  developed  that  perform  some  kind  of  advice- 
taking.  Moetow’s  F00  system  is  described  in  Article  XIV.C2.  The  reader 
is  also  directed  to  the  articles  on  TEIRESIAS  (Article  Vll.B,  in  Vol.  Il)  and  on 
Waterman’s  poker  player  (Article  XIV.DSb)  for  other  examples  of  advice-taFng 
systems. 

Reference t 

Davis’s  work  (1976,  1978)  describes  pioneering  efforts  in  interactive  advice¬ 
taking.  Hayes-Roth,  Klahr,  and  Mostow  (1981)  and  Mostow  and  llaycs-Roth 
(1979)  present  the  most  comprehensive  analyses  of  advice-taking  as  a  whole. 


C2.  Mostow’s  Operationalizer 


A  GROUP  of  researchers  at  the  Rand  Corporation,  Carnegic-Mellon  University, 
and  Stanford  University  has  recently  been  developing  the  machine-aided 
heuristic  programming  methodology  in  which  a  computer  would  be  instructed 
to  perform  a  new  task  in  much  the  same  way  that  a  person  is  taught  (see  llayes- 
Roth,  Klahr,  Burge,  and  Mostow,  1978;  Hayes- Roth,  Klahr,  and  Mostow, 
1981).  A  central  effort  in  this  project  is  understanding  the  problem  of  opera¬ 
tionalization  (see  Article  XIV. Cl).  Mostow's  program  I’OO  (First  Operational 
Operationaliicr)  is  one  of  the  first  results  of  this  work.  It  investigates  prin¬ 
ciples,  problems,  and  methods  involved  in  converting  high-level  advice  into 
effective,  executable  procedures. 

Accepting  Advice  About  the  Game  of  Hearts 

Mostow,  in  his  research  with  F00,  has  dealt  primarily  with  operationaliza¬ 
tion  problems  taken  from  the  card  game  of  Hearts.  The  game  is  played  as  a 
sequence  of  tricks.  In  each  trick,  one  player — who  is  said  to  have  the  lead — 
starts  the  trick  by  playing  a  card  and  each  of  the  other  players  continues  the 
trick  by  playing  a  card  during  his  (or  her)  turn.  If  he  can,  each  player  must 
follow  suit,  that  is,  play  a  card  of  the  same  suit  as  the  suit  led.  The  player 
who  played  the  highest  valued  card  in  the  suit  led  takes  the  trick  and  any 
point  cards  contained  in  it.  Every  heart  counts  as  one  point,  and  the  queen 
of  spades  is  worth  13  points.  The  goal  of  the  game  is  to  avoid  taking  points. 
Uaycs-Roth  et  al.  (1978)  provide  a  more  complete  explanation  of  the  game. 

Hearts  is  a  game  of  partial  information,  with  no  known  algorithm  for  win¬ 
ning.  Although  the  possible  situations  in  the  game  arc  extremely  numerous, 
beginning  players  often  hear  general  advice  such  as  “Avoid  taking  points,” 
“Don't  lead  a  high  card  in  a  suit  in  which  an  opponent  is  void,”  and  “If  an 
opponent  has  the  queen  of  spades,  try  to  (lush  it.”  The  task  of  the  F00 
program  is  to  take  such  general  advice  and  render  it  directly  applicable  by  a 
performance  program.  This  task  can  be  viewed  as  a  kind  of  planning  task. 
A  piece  of  advice,  such  as  “Avoid  taking  points,”  can  be  viewed  as  a  goal. 
The  operationalization  program  must  develop  an  executable  plan  for  achiev¬ 
ing  that  goal.  What  makes  this  advice  difficult  to  operationalize,  however, 
is  that  the  goal  can  be  ill-defined  and  unattainable.  It  is  impossible,  for 
example,  always  to  avoid  taking  points.  Instead,  the  program  must  develop 
approximate  strategies.  The  advice-giver  intends  the  goal  to  suggest,  but  not 
specify,  the  desired  behavior. 

F00  is  not  able  to  accomplish  this  advice-taking  task  unaided.  First, 
it  does  not  perform  the  interpretation  step  at  ail  but,  instead,  relies  on  the 
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user  to  translate  the  English  form  of  the  advice  into  an  unair  ^uous  lambda- 
calculus  representation.  Second,  FOO  cannot  perform  the  operationalization 
step  without  human  assistance.  Although  FOO  has  a  large  knowledge  base 
of  transformation  rules  and  an  interpreter  for  applying  those  rules,  it  must 
be  told  by  the  user  which  rules  to  apply.  The  user  must  operate  FOO  by 
repeatedly  selecting  an  appropriate  rule  and  indicating  which  expression  or 
subexpression  should  be  transformed.  Finally,  FOO  does  not  integrate  the 
operational  knowledge  it  develops  into  a  knowledge  base  that  could  drive  a 
Hcarts-playing  program.  No  performance  element  has  been  developed  that 
could  provide  an  empirical  test  of  the  operationalized  knowledge.  Despite 
these  shortcomings,  Mostow’s  work  on  FOO  provides  an  in-depth  analysis. of 
the  techniques  required  to  perform  operationalization. 

The  primary  way  in  which  advice  is  operationalized  in  FOO  is  by  applying 
operationalization  methods,  such  as  heuristic  search,  the  pigeonhole  principle, 
and  finding  necessary  or  sufficient  conditions.  Mostow  claims  that  this  is 
precisely  what  knowledge  engineers  and  AI  researchers  do  when  they  are 
faced  with  a  new  problem  to  solve:  They  look  in  their  bag  of  tricks  Tor 
a  method,  such  as  worst-case  analysis,  that  allows  them  to  construct  an 
effective,  but  inefficient,  program.  This  program  can  then  be  further  refined 
by  applying  other  knowledge  and  advice.  Mostow’s  work  can  thus  be  viewed 
as  formalizing  the  knowledge  and  techniques  used  by  Al  researchers  to  do 
heuristic  programming. 

The  most  sophisticated  of  FOO’s  operationalization  methods  is  the 
heuristic-search  method.  When  FOO  needs  to  evaluate  a  predicate,  such  as 
(TaKE- POINTS  ME),  over  a  sequence,  such  as  the  sequence  of  cards  in  a  trick, 
it  is  able  to  reformulate  this  problem  as  a  heuristic  search  of  the  space  of  all 
possible  tricks.  FOO  starts  with  a  basic  generate-and-test  algorithm  (discussed 
in  Article  II.A,  in  Vol.  l)  and  refines  it  into  a  heuristic  search  by  improving  the 
ways  the  algorithm  (a)  selects  the  next  node  to  expand,  (b)  selects  possible 
expansions  of  the  node  to  apply,  (c)  prunes  nodes  from  the  search  tree,  and 
(d)  prunes  possible  expansions  prior  to  applying  them.  The  overall  effect  of 
these  refinements  is  to  move  constraints  from  the  test  portion  or  the  algorithm, 
that  is,  the  step  that  checks  to  sec  whether  the  goal  has  been  achieved,  into 
the  generate  portion  of  the  algorithm,-  that  is,  the  step  that  chooses  which 
nodes  to  expand  and  how  they  should  be  expanded.  Some  refinements  actu¬ 
ally  move  constraints  out  of  the  search  altogether  by  precompiling  them  into 
tables  or  by  modifying  the  algorithm  to  search  a  smaller  space. 

In  the  “Avoid  taking  points"  problem,  for  example,  FOO  starts  with  a 
simple  gcuerate-and-tcsl  algorithm  that  generates  all  possible  tricks  and  tests 
to  sec  if  MIC  (FOO’s  performance  persona)  lakes  any  points.  This  is  gradually 
converted  into  a  heuristic  search  in  which  the  only  tricks  considered  arc  those 
in  which  ME  plays  a  card  higher  than  any  card  played  so  far  in  the  suit 
led.  Additional  heuristics,  such  as  generating  tricks  that  contain  points  first 
and  pruning  tricks  in  which  the  opponents  play  cards  higher  than  ME,  are 


352 


Learning  and  Inductive  Inference 


XTV 


extracted  from  the  teat  and  applied  earlier  in  the  search  to  order  and  prune 
the  search  tree. 

Underlying  all  of  FOO’s  operationalization  methods  is  its  basic  ability  to 
reformulate  an  expression  in  many  different  ways.  For  example,  in  order  to 
evaluate  (VOID  PI  Si)  (i.e.,  player  Pi  is  void  in  suit  S|),  FOO  must  reformulate 
VOID  in  terms  of  observable  variables  such  as  the  number  of  cards  already 
played  in  the  suit  S{.  In  order  for  FOO  to  recognize  that  an  operationaliza¬ 
tion  method  is  applicable,  it  must  often  do  some  reformulations.  Then,  in 
order  actually  to  apply  the  method,  FOO  may  need  to  do  some  further  refor¬ 
mulations.  The  heuristic  search  method,  for  instance,  is  applicable  only  to 
a  problem  that  is  expressed  as  a  search  through  some  space.  Consequently, 
in  order  to  use  heuristic  search  to  operationalize  the  “Avoid  taking  points” 
advice,  FOO  must  first  reformulate  the  advice  as  a  predicate  over  the  search 
space  of  all  possible  tricks.  The  heuristic  search  can  then  search  this  space 
for  those  tricks  that  do  not  contain  points. 

The  reformulation  and  operationalization  process  is  accomplished  by  ap¬ 
proximately  200  transformation  rules  (Moslow,  in  press).  These  rules  employ 
analysis  techniques  and  domain  knowledge  to  successively  reformulate  the 
advice  into  an  operational  form.  In  this  article,  we  trace  a  portion  of  FOO’s 
operationalization  of  the  “Avoid  taking  points"  advice  to  show  how  these 
reformulation  techniques  are  applied.  Before  doing  this,  however,  we  describe 
the  knowledge  that  FOO  has  initially  and  how  it  is  represented. 

FOO 'a  Initial  Knowledge  Base 

FOO's  performance  knowledge  is  made  up  of  domain  concept.i,  plus  rules 
and  heuristics  that  arc  composed  in  terms  of  these  concepts  The  advice 
offered  to  the  program  likewise  consists  of  domain  concepts,  plus  composi¬ 
tions  of  concepts.  So  as  long  as  these  compositions  of  b;isic  concepts  can 
be  described  in  general  ways,  both  the  performance  knowledge  and  the  ad¬ 
vice  for  building  and  improving  it  can  be  used  and  manipulated  by  domain- 
independent  methods  (see  llayes-Roth  ct  al.,  1981,  for  further  discussion). 

For  example,  in  the  domain  of  the  card  game  Hearts,  basic  concepts 
include: 

deck,  hand,  card,  suit,  spades,  deal,  round,  trick,  avoid,  point, 

player,  play,  take,  lead,  win,  follow  suit. 

Examples  of  advice  in  the  form  of  behavioral  constraints  include: 

The  lead  of  the  lirsl  trick  is  by  the  player  with  the  2C. 

Each  player  must  follow  suit  ir  possible. 

The  player  of  the  highest  card  in  the  suit  led  wins  the  trick. 

The  winner  of  a  trick  leads  the  next  trick. 

Advice  in  the  form  of  heuristics  includes: 
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If  the  queen  of  spades  has  not  been  pta/cd,  then  Qush  it  out. 

Take  all  the  points  in  a  round. 

if  you  can't  take  all  the  points  in  a  round,  then  take  as  few 
as  possible. 

If  necessary,  take  a  point  to  prevent  someone  else  from  taking 
them  all. 

A  constraint  such  as  “The  lead  of  the  first  trick  is  by  the  player  with  the  2C" 
is  represented  ;is  a  composition,  using  domain-independent  concepts  like  first 
and  with  and  domain-dependent  concepts  like  lead ,  trick ,  player,  and  2C. 

An  Example:  Operationalizing  “Avoid  Taking  Points ” 

After  advice  has  been  interpreted  into  an  internal  representation  that  ia 
precise  and  unambiguous,  it  might  be  in  an  operational  form,  for  example, 
“Play  a  low  card."  On  the  other  hand,  it  may  be  far  more  general:  “Avoid 
taking  points.”  Experienced  Hearts  players  will  recognize  that  the  first, 
specific  piece  of  advice  is  a  possible  strategy  for  carrying  out  the  latter,  general 
advice.  Hut  it  is  a  rather  simplistic  strategy,  more  appropriate  for  the  later 
stages  of  a  game  than  for  the  beginning.  Furthermore,  repeated  attempts 
to  play  low  cards  will  sometimes  conflict  with  other  advice.  For  purposes  of 
illustration,  however,  operationalizing  even  a  quite  simple  goal  can  require  a 
wide  range  of  knowledge  and  methods  (see  Mostow,  1981;  Ilayes-Roth  et  al., 
1981).  For  the  remainder  of  this  article,  several  of  the  methods  and  problems 
of  operationalization  will  be  illustrated  by  showing  how  advice  such  as  this 
can  be  converted  into  directly  executable  procedures. 

First,  consider  how  a  person  might  handle  advice  such  as  “Avoid  taking 
points.”  lie  might  apply  it  to  a  specific  situation  by  rc:isoning  as  follows: 

1.  To  avoid  taking  points  iu  general,  I  should  avoid  taking  any  points  in  the 
current  trick  (a  single  round  in  which  one  card  is  played  by  each  player). 

2.  Thus,  if  the  trick  contains  points  (cither  a  heart  or  the  queen  of  spades), 

I  should  try  not  to  win  it. 

3.  I  can  do  this  by  trying  not  to  play  the  winning  card. 

■1.  That  can  be  done  by  my  playing  a  card  lower  than  some  other  card 
played  in  the  suit  led. 

Each  step  above  is  an  attempt  to  implement  the  previous  statement  as  closely 
as  possible  by  restatement  in  successively  more  specific,  operational  terms. 
Some  restatements  may  fully  preserve  the  truth  or  accuracy  of  the  previous 
one,  while  others  may  be  very  suppositional  (i.e.,  valid  given  certain  assump¬ 
tions)  or  more  restrictive  (i.e.,  valid  oniy  in  certain  situations).  The  final 
statement  above  is  not  a  very  sophisticated  plan,  but  it  is  at  least  a  reasonable 
operationalization  of  the  initial  advice,  and  it  represents  a  kind  of  process 
that  seems  very  common  in  human  learning.  A  problem-reduction  strategy  is 
employed  until  the  advice  can  be  applied  directly  in  the  given  situation. 
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Now  that  we  have  a  sense  of  how  a  person  might  operationalize  “Avoid 
taking  points,”  we  trttc  the  methods  applied  by  F00  to  accomplish  this  task. 
The  following  example  is  based  on  Derivation  6  in  Mostow  (1981)  in  which 
he  guided  FOO  to  reformulate  “Avoid  taking  points"  as  “Play  a  low  card." 
This  particular  trace  shows  the  use  of  several  simple  operationalization  and 
reformulation  methods  but  does  not  show  the  application  of  the  heuristic- 
search  method  discussed  above. 

To  begin  with,  the  advice  must  be  interpreted  into  a  tractable  repre¬ 
sentational  form,  such  as: 

(avoid  (taico-pointa  m)  (trick)) 

That  is,  “A'foid  the  event  in  which  ME  takes  points  during  the  current  trick.” 
In  FOO,  this  is  done  manually  by  the  advice-giver. 

A  useful  beginning  in  operationalization  is  to  elaborate  the  original  advice 
by  expanding  definitions  (first  of  “avoid"  and  then  of  “trick”).  The  point  is  to 
unfold  high-level  terms  so  that  the  expression  can  be  more  easily  manipulated. 
The  results  arc 

[achiovo  (not  (during  (trick)  (taka-points  a#)))] 

and 

(achieve  (not  (during  [scenario 

(each  p  (players) (play-card  p)) 
(take-trick  (trick-vinner))] 

(take-points  me)))). 

The  advice  in  this  form  is  still  not  operational,  since  it  depends  on  the 
outcome  of  the  trick,  which  is  not  generally  knowable  at  the  time  ME  needs 
to  choose  an  action  in  accordance  with  the  advice.  Therefore,  a  case  analysis 
is  done  on  the  subexpression  (during  . . . ) .  The  idea  is  to  reformulate  a  single 
concept  as  several  disjoint  expressions  that  can  be  evaluated  separately.  To 
this  end,  the  single  (during...)  expression  is  split  into  two  expressions  that 
depend  on  alternative  assumptions.  Here,  taking  points  during  the  two-part 
“scenario”  above  can  be  considered  as  either  of  two  possible  canes:  that  taking 
points  occurs  during  (a)  the  playing  of  cards  or  (b)  the  taking  of  the  trick. 
The  transformation  results  in: 

(achlave  (not  (or  [during  (each  p  (players)  (play-card  p)) 

(tako-polnts  me)] 

(during  (take-trick  (trick-winner)) 

(take-points  me)}))). 

The  next  transformation  eliminates  impossible  cases.  When  expressions 
cannot  be  achieved  because  of  impossible  conditions,  the  learnct  should  recog¬ 
nize  this  and  drop  them  from  consideration.  Here,  the  first  case  can  be  ignored 
bccauao-tbcFcTs^no  way  to  take  points  during  the  play  of  the  cards  (it  is 
possible  only  after  all  players  have  played,  when  the  trick  is  taken).  FOO 
recognizes  this  by  an  intersection  search.  It  searches  through  the  knowledge 
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base  of  defined  concepts  for  a  common  subevent  of  the  two  events  (each  p 
(players)  (play-card  p))  and  (take-points  me).  Since  no  common  subevent 
is  found  for  these  two,  FOO  concludes  that  the  situation  is  an  impossible  one 
and  eliminates  it.  (For  the  second  case,  taka-trick  and  taka-points  have  a 
common  sub-event,  taka.)  The  advice  now  is: 

(achieve  (not  (daring  (taka-trick  (trick-winner)) 
(taka-points  aa)])). 

The  advice  is  still  far  from  operational.  One  difficulty  is  that  neither 
taka-trick  nor  trick-winner  is  immediately  evaluable  at  the  time  a  card  must 
be  chosen  for  play.  At  this  point,  the  problem  can  be  reduced  by  reexpressing 
different  concepts  in  common  terms.  This  is  possible  here  by  again  elaborating 
definitions  and  restructuring  the  subexpressions.  Since  taka-points  occurs 
during  taka-trick,  the  expression  can  be  reformulated  as: 

(achiawa  (not  (axists  cl  (cards-played) 

(axists  c2  (point-cards) 

(during  (taka  (trick-winner)  cl) 

(taka  aa  c2)))])). 

This  says,  “Make  sure  the  situation  does  not  happen  where  MM  takes  a  point 
card  (c2)  during  the  time  that  the  winner  of  the  trick  takes  the  cards  played.” 

A  orocess  of  partial  matching  recognixes  that  the  two  events  in  the  during 
subexpression  are  closely  related  and  thus  arc  candidates  for  simplification, 
depending  on  the  constraints  of  the  during  predicate.  Using  domain  knowl¬ 
edge  of  relationships  among  the  concepts,  the  terms  can  be  combined  and  the 
subexpression  made  less  complex.  Instead  of  the  complicated  relation  during, 
the  events  become  joined  by  the  far  simpler  predicates  =  and  and.  We  now 
have: 

(achieve  (not  (exists  cl  (cards-played) 

(exists  c2  (point-cards) 

(and  (=  (trick-winner)  me) (=  cl  c2)])))). 

Further  analysis  at  this  point  shows  that  simplification  of  some  forms  is 
possible.  The  central  purpose  of  searching  for  simplifications  is  to  restructure 
expressions  to  make  them  more  amenable  to  further  analysis.  Examples  of 
simplifying  methods  are  deleting  null  clauses  from  a  disjunction,  transforming 
an  expression  into  a  constant  (by  evaluation),  applying  logical  transformations 
(such  as  Dc  Morgan’s  laws),  or  removing  quantifiers  when  possible.  The  last 
of  these  methods  is  appropriate  here,  since  cl  and  c2  denote  the  same  object: 
a  point  card.  Thus  with  some  reformulation  employing  domain  knowledge, 
one  variable  can  be  replaced  by  the  other,  and  the  condition  that  they  be 
equal  can  be  dropped.  The  expression  is  transformed  into: 

(achieve  (not  (and  (a  (trick-winner)  me) 

(exists  cl  (cards-played) 

(in  cl  (point-cards)))])). 
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Another  kind  of  pattern- mate  King  can  accomplish  another  kind  of  sim¬ 
plification:  By  looking  for  canonical  constructions,  the  operation  . liter  can 
recognise  known  eoncepte.  If  the  form  of  a  lower  level  expression  fits  the 
definition  of  a  higher  level  concept,  the  former  can  be  replaced  by  its  simpler 
equivalent.  (Note  that  this  is  the  inverse  of  the  first  transformation  mentioned 
above:  expanding  definitions.)  In  this  case,  the  last  two  lines  of  the  above 
expression  match  the  definition  of  trick-has-points.  This  is  analogous  to  the 
psychological  process  of  chunking.  In  addition  to  all  the  analytical  advantages 
gained  by  simplification,  the  recognition  of  known  concepts  can  also  enable 
the  application  of  previously  learned  knowledge  about  them  (e.g.,  ways  to 
predict  the  likelihood  that  a  trick  will  have  points  in  it).  Our  expression  is 
now  reduced  to  not  winning  a  trick  that  has  points: 

(achieve  (not  (and  (-  (trick-winner)  ne)  [trick-has-points]) ) ) . 

The  expression  is  still  not  operational,  since  trick-winner  is  not  generally 
knowable  at  the  time  of  choosing  which  card  to  play.  The  concept  of  trick- 
winner  u  further  analyicd,  and,  in  fact,  it  lakes  about  20  further  transforma¬ 
tions  to  reformulate  the  above  expression,  “Try  not  to  win  a  trick  that  has 
points,”  into  “If  you’re  following  suit  in  a  trick  with  points,  try  to  plav  lower 
than  some  other  card  played  in  the  suit  led."  Symbolically,  this  looks  like: 

(achieve  (=>  [and  (ln-sult-led  (card-of  ne)) 

(trick-has-points) ] 

Clover  (card-of  ae) 

(f ind-eleeent  (cards-played-in-suit-led))])) . 

But  this  still  is  not  operational,  since  in  general  the  set  cards-played-in- 
anlt-led  is  not  fully  known  at  the  time  that  ME  must  choose  a  card.  Since 
Hearts  is  a  game  of  imperfect  information,  this  set  cannot  generally  be  known, 
but  the  data  available  (cards  already  played)  can  be  used  to  approximate  the 
result.  Here,  the  binary  relation  lower  is  approximated  by  the  unary  predicate 
low.  In  other  words,  in  the  absence  ol  complete  information  for  evaluating  a 
comparative^prcdicale  (lower  xi  x2),  use  instead  an  estimating  function  (low 
x')  that  may  not  be  exact  but  can  produce  a  result  from  the  available  data. 
The  approximation  is: 

(achiovw  (=>  (and  (ln-suit-led  (card-of  mo)) 

(trick-haa- points)) 

[low  (card-ot  me)])). 

This  is  now  very  close  to  being  operational.  Low  is  an  imprecise  term  but 
can  be  treated  ;is  a  fx.  zy  predicate  (see  Zadeli,  1970)  -that  is,  it  could  be 
used  to  order  potential  candidates  Tor  the  choice  variable,  card-of  mo. 

The  only  remaining  barrier  to  full  operationally  is  the  predicate  (trick- 
haa-pointa).  This' ‘also  is  not  always  knowable  at  the  time  of  choosing  a 
card  to  play..  However,  further  analysis  leads  to  application  of  a  rule  that 
formulates  an  assertion  as  possible  (cITcctivcIy  assuming  it  to  be  true)  in  the 
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absence  of  any  knowledge  to  the  contrary.  Even  when  a  predicat;  p  is  not 
evaluable,  (possible  *p)  will  be. 

Thus,  the  fully  operational  (though  approximate)  reformulation  of  the 
original  “Avoid  taking  points"  is  “If  following  suit  in  a  trick  that  may  have 
points,  play  a  low  card.”  Again,  the  result  may  not  always  be  the  most  effective 
action  and  may  be  in  conflict  with  other  advice.  These  are  issues  to  be  decided 
by  the  evaluating  module  of  the  learning  elern-nt  and  by  the  performance 
element  of  the  program.  The  symbolic  form  of  the  operationalized  advice  is: 

(achieve  (=>  [and  (ln-suit-led  (card-of  me)) 

[possible  (trick-has -points) ] ] 

(low  (card-oi  ■•)])) . 

Conclusion 

I 

The  example  given  above  is  a  useful  one  because  of  the  diversity  of  its 
reformulations,  not  because  o'  any  completeness.  Among  the  most  useful 
contributions  of  this  research  has  been  an  introduction  to  the  considerable 
complexity  of  opcrationaliling  advice.  Of  the  13  examples  of  operationalized 
advice  given  in  Mostowls  thesis  (1981),  a  couple  required  only  a  handful  of 
transformations  (a  minimum  of  8),  but  several  required  over  100.  About  10 
domain-independent  transformational  rules  were  mentioned  in  the  example 
above,  but  over  200  such  rules  have  been  formulated  and  included  in  the  sys¬ 
tem.  Mostow  (1981)  gives  a  taxonomy  of  operationalization  methods  accord¬ 
ing  to  their  purpose,  spope,  and  accuracy.  This  taxonomy  is  outlined  in 
Figure  C2-1;  each  category  is  illustrated  by  one  or  more  methods. 

The  greatest  shortcoming  of  the  work  on  FOO  is  the  lack  of  a  control 
structure  that  could  apply  these  operationalization  methods  automatically. 
The  development  of  such  a  control  regime  may  be  quite  difficult.  Mostow 
suggests  using  means-ends  analysis  (see  Article  II. D2,  in  Vol.  l)  and  describes 
how  his  execution  of  rules  often  conformed  to  the  following  pattern: 

1.  Reformulate  an  expression  until  it  is  possible  to 

2.  recognise  that  the  method  is  applicable  and  decide  to  apply  it,  so 

3.  reformulate  the  expression  to  match  the  method  problem  statement  and 

4.  fill  in  addidonal  ir/ormation  required  by  the  method;  their 

5.  refine  the  instantiated  method  by  applying  additional  domain  knowledge. 

A  second  shortcoming  of  FOO  is  that  its  methods  are  quite  specific  to  the 
game  of  Hearts  and  similar  tasks.  The  development  c.f  a  general-purpose 
operationalization  program  will  require  the  explication  of  many  more  opera¬ 
tionalization  methods.  Still,  these  first  stops  in  operationalization  should 
prove  valuable  cither  for  the  overall  project  of  machine-aided  heuristic  pro¬ 
gramming  (see  the  beginning  of  this  article)  or  for  future  elTorts  al  implement¬ 
ing  advice-taking  systems. 
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1.  Methods  for  evaluating  an  expression 

a.  Procedures  that  always  produce  a  result  (assuming  their  inputs 

are  available) 

"Pigeonhole  principle'' 

“Historical  leasoning" 

"Heuristic  search" 

b.  Procedures  that  sometimes  produce  a  result 

“Check  a  necessary  or  sullicient  condition" 

“Make  a  simplifying  assumption  that  restricts  the  scope 
of  applicability" 

c.  Procedures  that  produce  an  approximate  result 

“Apply  formula  for  probability  that  randomly  chosen 
subsets  overlap" 

“Characterize  a  quantity  as  an  increasing  or  decreasing 
function  of  some  variable” 

‘  Use  an  untested  simplifying  assumption* 

“Predict  others'  choices  pessimistically” 

2.  Methods  for  achieving  a  goal 

a.  Sound  methods  (introduce  no  errors) — execution  of  plan  (when 

feasible)  will  achieve  goal 
“To  empty  a  set,  remove  one  element  at  a  time* 

“Kind  a  sufficient  condition  and  achieve  it’ 

“Restrict  a  choice  to  satisfy  the  goal” 

“Modify  a  plan  for  one  goal  to  achieve  an  additional  goal* 
“To  achieve  a  goal  with  a  future  deadline,  satisfy  it  now 
and  then  avoid  violating  it” 

b.  Heuristic  methods — execution  of  plan  may  not  always 

achieve  goal 

“Simplify  the  goal  by  arbitrarily  choosing  a  value  Tor 
one  of  its  variables” 

“Kind  a  necessary  condition  and  achieve  it" 

“Order  choice  set  with  respect  to  goal" 


Figure  C2-1.  Taxonomy  of  operationalization  methods. 
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Mob  tow  (1031)  is  the  most  comprehensive  description  of  FOO.  The  arti¬ 
cles  by  Ilayes-Roth,  Klahr,  and  Mostow  (1080,  1931)  and  by  Hayes-Roth, 
Klahr,  Burge,  and  Mostow  (1078)  provide  a  good  overview  of  the  idea  of 
machine-aided  heuristic  programming.  Mostow  (in  press)  describes  the  work 
on  heuristic  search. 


D.  LEARNING  FROM  EXAMPLES 


Dl.  Issues 


THE  PROSPECT  of  creating  a  program  that  can  learn  from  examples  has 
attracted  the  attention  of  /VI  researchers  since  the  1950s.  McCarthy  (1958, 
p.  78)  said,  “Our  ultimate  objective  is  to  make  programs  that  learn  from  their 
experience  as  elfectively  as  humans  do."  Of  course,  the  attainment  of  this  goal 
still  lies  in  the  distant  future.  The  area  of  learning  from  examples  is,  however, 
the  best  understood  aspect  of  learning. 

’  A  program  that  learns  from  examples  must  reason  from  specific  instances 
to  general  rules  that  ran  be  used  to  guide  the  actions  of  the  performance 
clem  nt.  The  learning  element  is  presented  with  very  low  level  information, 
in  the  form  of  a  specific  situation  and  the  appropriate  behavior  for  the  per¬ 
formance  element  in  that  situation,  and  it  is  expected  to  generalize  this  infor¬ 
mation  to  obtain  general  rules  of  behavior. 

Consider,  for  example,  a  program  that  is  learning  to  play  checkers.  One 
way  to  train  the  program  is  to  present  it  with  particular  chcckcrs-board 
situations  and  tell  it  what  the  best  moves  are.  The  program  must  generalize 
from  these  particular  moves  to  discover  strategies  for  good  play.  Similarly,  if 
we  arc  teaching  a  program  the  concept  of  a  dog,  for  example,  we  might  present 
the  program  with  various  animals  (and  other  things)  and  tell  it  whether  or 
not  they  are  dogs.  The  program  must  develop  general  rules  for  recognizing 
dogs  and  distinguishing  them  from  everything  else  in  the  world. 

Simon  and  Lea  (197-1),  in  an  important  early  paper  on  induction,  describe 
the  problem  of  learning  from  examples  as  the  problem  of  using  training 
instances,  selected  from  some  space  of  possible  instances,  to  guide  a  search  for 
general  rules.  They  call  the  space  of  possible  training  instances  the  instance 
space  and  the  space  of  possible  general  rules  the  rule  space.  Furthermore, 
Simon  and  Lea  point  out  that  ,  n  intelligent  program  might  select  its  own 
training  instances  by  actively  searching  the  instance  space  in  order  to  resolve 
some  ambiguity  about  the  rules  in  the  rule  space.  Thus,  if  the  program  were 
unsure  whether  all  dogs  have  four  tegs,  it  might  search  the  instance  space  for 
animals  with  dilfercnt  numbers  of  legs  to  see  which  ones  arc  dogs.  Simon  and 
Lea  view  a  learning  system  as  moving  back  and  forth  between  an  instance 
space  and  a  rule  space  until  it  has  converged  on  the  desired  rule. 

This  two-space  view  of  learning  from  examples  as  a  simultaneous,  coopera¬ 
tive  search  of  the  instance  space  and  the  rule  space  is  a  good  perspective  for 
organizing  this  article.  We  will  use  the  terms  instance  space  and  rule  space 
even  in  situations  where  the  rule  space  docs  not  contain  rules  but,  instead, 
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contains  some  other  high-level  descriptions  of  the  knowledge  needed  by  the 
performance  element. 

Figure  Dl-1  shows  a  schematic  diagram  of  the  two-space  model  of  learning 
from  examples.  In  addition  to  the  instance  space  and  the  rule  space,  the 
processes  of  interpretation  and  experiment  planning  arc  depicted.  In  some 
learning  situations,  the  training  instances  are  provided  in  a  form  Tar  removed 
from  the  form  of  the  rules  in  the  rule  space.  As  a  result,  when  the  program 
moves  from  the  instance  space  to  the  rule  apace,  special  processes  are  needed 
to  interpret  the  raw  training  instances  so  that  they  can  guide  the  search  of  the 
rule  space.  Similarly,  when  the  program  needs  to  gather  some  new  training 
instances,  special  experiment-planning  routines  are  needed  so  that  the  current 
high-level  hypotheses  can  guide  the  search  of  the  instance  space. 

As  an  example  of  the  two-space  model,  consider  the  problem  of  teaching 
a  computer  program  the  concept  of  a  flush  in  poker  (i.e.,  a  hand  in  which  all 
five  cards  have  the  same  suit).  The  instance  space  in  this  learning  problem  is 
the  space  of  all  possible  poker  hands.  We  can  represent  an  individual  point 
in  this  space  as  a  set  of  five  ordered  pairs,  for  example,  ■ 

{[2, clubs),  (3,ciu4»),  (5,c/u4j),  (jack, clubs),  [king, clubs)}. 

Each  ordered  pair  specifies  the  rank  and  suit  of  one  of  the  cards  in  the  hand. 
The  entire  instance  space  is  the  space  of  all  such  five-card  sets. 

The  rule  space  in  this  problem  could  be  the  space  of  all  predicate  calculus 
expressions  composed  of  the  predicates  SUIT  and  ItANK;  thcvariahles  e%,  Cj, 
cj,  cj,  cj  for  the  cards;  any  necessary  free  variables;  the  constant  values 
of  clubs,  diamonds,  hearts,  spades,  ace,  2,  3,  4,  5,  6,  7,  8,  9,  10,  jack, 
queen,  and  king;  the  conjunction  operator  (a);  and  the  existential  quantifier 
(3).  This  rule  space  includes  concepts  such  as  contains  at  least  three  cards  of 
the  same  rank: 
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3  Ci,C2,Cj  :  RANK(e,,x)  A  RANK(c2>*)  A  RANK(c3,*), 
and  also  the  desired  concept  of  a  flush: 

3  Cl1C2.C3|C4,C5  :  SUIT(c(,*)  A  SUIT(c2,i)  A  SUIT(c2,x)  A 
SUIT{c4,x)  A  SUtT(cj,  x) . 

Note  that  this  rule  space  does  not  contain  the  concept  of  a  straight. 

A  learning  program  for  searching  these  two  spaces  might  operate  as 
follows.  First,  the  program  selects  a  training  instance  from  the  instance 
space  and  asks  the  teacher  whether  it  is  an  instance  of  the  desired  concept. 
This  information  (the  instance  and  its  classification)  is  converted  by  the 
interpretation  procedures  into  a  form  that  can  help  guide  the  search  of  the 
rule  space.  When  some  plausible  candidate  concepts  arc  found  in  the  rule 
space,  experiment-planning  routines  decide  which  training  instances  should 
be  examined  next.  If  the  learning  program  works  properly,  it  will  eventually 
choose,  as  its  best  candidate  concept,  the  Hush  concept  shown  above. 

Learning  systems  that  employ  the  two-space  approach  are  making  use 
of  the  closed-world  assumption,  that  is,  the  assumption  that  the  rule  space 
contains  the  desired  concept.  The  closed-world  assumption  allows  programs 
to  locate  the  desired  concept  by  progressively  excluding  candidate  concepts 
that  are  known  to  be  incorrect. 

This  two-space  view  of  learning  from  examples  helps  to  elucidate  many  of 
the  design  issues  for  learning  systems.  In  this  article,  we  follow  this  two-space 
model  full  circle.  We  examine,  in  turn,  the  issues  concerning  the  instance 
space,  the  interpretation  process,  the  rule  space,  and  the  experiment-planning 
process. 

Instance  Space 

The  first  issue  involving  the  instance  space  is  the  quality  of  the  train¬ 
ing  instances.  High-quality  training  instances  are  unambiguous  and  thus 
provide  reliable  guidance  to  the  search  of  the  rule  space.  Low-quality  train¬ 
ing  instances  invite  multiple,  conflicting  interpretations  and,  consequently, 
provide  only  tentative  guidance  to  the  rule-space  search. 

Consider  again  the  problem  of  teaching  a  program  the  concept  of  a  flush. 
There  are  several  sources  of  ambiguity  that  could  make  it  difficult  for  the 
program  to  discover  the  concept  from  training  instances. 

First,  the  instances  may  contain  errors.  If  the  descriptions  of  the  in¬ 
stances  arc  incorrect,  for  example,  if  a  2  of  clubs  is  incorrectly  observed  to  be 
a  2  of  spades,  the  error  is  a  measurement  error.  If,  on  the  other  hand,  the 
classification  of  the  hand  (as  being  a  flush  or  not  being  a  flush)  is  incorrect, 
the  error  is  a  classification  error.  Two  kinds  of  classification  errors  can  occur. 
The  program  can  be  told  that  a  sample  hand  is  a  flush  when  in  fact  it  is 
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not — a  false  positive  instance— or  that  it  is  not  a  flush  when  in  fact  it  is — a 
false  negative  instance. 

A  second  source  of  ambiguity  arises  if  the  program  must  learn  from 
unclassified  training  instances,  in  these  so-called  unsupertnsed  learning  situa¬ 
tions,  the  program  is  given  heuristic  information  that  it  must  use  to  classify 
the  training  instances  itself.  If  this  heuristic  knowledge  is  weak  and  imper¬ 
fect,  the  rule-space  search  must  treat  the  resulting  classifications  as  being 
potentially  incorrect. 

A  third  factor  relating  to  the  quality  of  the  training  instances  is  the 
order  in  which  they  are  presented.  A  good  training  sequence  systematically 
varies  the  relevant  features  to  determine  which  features  are  important.  When 
a  program  is  selecting  training  instances,  it  attempts  to  construct  a  good 
training  sequence  for  itself.  The  task  of  learning  is  made  much  easier  if  there 
is  a  teacher  who  car.  be  counted  on  to  perform  Ill's  function.  In  such  cases, 
a  program  can  reason  about  a  puzzling  instance  by  trying  to  infer  “what  the 
teacher  was  getting  at"  in  presenting  the  example. 

The  main  point,  then,  is  that  high-quality  training  instances  are  unam¬ 
biguous.  Under  such  favorable  conditions,  the  program  can  be  designed  to 
embody  a  whole  set  of  constraining  assumptions  about  the  examples  that 
permit  it  to  locate  rapidly  the  appropriate  high-level  rule-  in  the  rule  space. 
Low-quality  instances,  again,  are  ambiguous,  because  the  program  must  con¬ 
sider  a  much  larger  space  of  hypotheses.  Thus,  if  it  is  possible  that  the  training 
instances  contain  errors,  the  program  must  consider  the  hypothesis  that  any 
given  instance  is  incorrect  due  to  cither  measurement  error  or  classification 
error.  In  general,  the  more  constraints  a  program  can  assume  about  the  data, 
the  more  easily  it  can  learn  from  them. 

The  second  design  issue  concerning  the  instance  space  is  the  question  of 
how  it  should  be  searched.  This  issue  has  not  received  much  attention  in  AI 
research,  since  most  work  has  assumed  either  that  the  instances  are  presented 
all  at  once  or  else  that  the  program  has  no  control  over  their  selection.  (See, 
however,  Rissland  and  Soloway,  1980,  for  recent  work  on  instance  selection.) 
Programs  that  can  update  their  hypotheses  as  additional  training  instances 
arc  selected  (or  are  made  available  by  the  environment)  are  said  to  perform 
incremental  learning.  Programs  that  explicitly  search  the  instance  space  are 
said  to  perform  active  instance  selection - 

Most  methods  of  searching  the  instance  space  make  use  of  a  set,  If,  of 
hypotheses  in  the  rule  space  that  are  currently  believed  by  the  program  to  be 
most  plausible.  One  approach  is  to  try  to  discriminate  as  much  as  possible 
among  the  alternatives  within  H.  A  training  instance  can  be  chosen  that 
“splits  II  in  half,"  so  that  half  of  the  hypotheses  can  be  ruled  out  when 
the  new  instance  is  obtained.  Another  approach  is  to  choose  the  most  likely 
hypothesis  in  //and  try  to  confirm  it  by  checking  additional  training  instances 
(particularly  instances  with  extreme  characteristics).  Using  a  confirmatory 
strategy,  the  learning  system  can  determine  the  limits  of  applicability  of  the 
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hypothesis  under  consideration.  A  third  approach,  called  expectation-based 
filtering,  selects  training  instances  that  contradict  the  hypotheses  in  H  (see 
Lenat,  Ilayes-R  jth,  and  Klahr,  1979).  The  hypotheses  in  H  are  used  to 
filter  out  those  instances  that  are  expected  to  be  true  (i.c.,  those  that  are 
consistent  with  H),  so  that  the  learning  program  can  focus  its  attention 
on  those  instances  in  which  its  current  hypotheses  break  down.  Finally,  an 
important  consideration  may  be  the  size  of  //,  or  other  computational  costs 
associated  with  the  learning  process.  In  such  cases,  new  instances  may  be 
selected  to  minimize  these  computational  costs.  For  example,  the  program 
might  try  to  rule  out  only  one  factor  at  a  time  in  order  to  reduce  the  cost  of 
comparing  a  drastically  different  training  instance  with  each  hypothesis  in  //. 

Interpretation  Processes 

Once  the  training  instances  have  been  selected,  they  may  need  to  be 
transformed  before  they  can  be  used  to  guide  the  search  of  the  rule  space.  This 
transformation  process  can  be  quite  dillicult,  especially  in  perceptual  learning 
tasks.  Suppose,  for  example,  that  we  wish  to  train  a  computer  to  recog  ze 
the  concept  of  an  arch  constructed  from  toy  blocks.  The  program  will  be 
presented  with  a  line  drawing  of  a  scene  involving  a  structure  of  blocks  and 
told  whether  or  not  the  scene  contains  an  arch.  Winston’s  (1970)  program  that 
solves  this  learning  task  (sec  Article  XIV. D3n)  makes  extensive  use  of  “blocks- 
world  knowledge"  to  interpret  the  line  drawing  and  extract  a  relational  graph 
structure  that  indicates  which  blocks  are  resting  on  top  of  which  other  blocks, 
which  blocks  arc  touching,  and  so  forth.  These  are  the  relations  needed  to 
express  the  concept  of  an  arch. 

Another  learning  program  that  performs  extensive  interpretation  of  the 
training  instances  is  Soloway ’s  (1978)  BASEBALL  system.  The  raw  training 
instances  arc  roughly  2,000  noise-free  "snapshots"  of  a  baseball  game.  The 
snapshots  give  the  locations  of  the  nine  players  on  the  two  teams  (c.g.,  (AT  Pi 
FIRST-BASE)),  the  location  of  the  ball,  and  the  state  of  the  scoreboard.  The 
program  is  composed  of  a  sequence  of  nine  steps  that  employ  various  kinds  of 
knowledge  to  interpret  and  generalize  the  training  instances.  The  first  three 
steps  apply  general  knowledge  about  games  to  filter  out  periods  of  inactivity 
and  focus  on  cycles  of  high  activity.  The  next  three  steps  apply  knowledge 
about  physics  and  about  competition  and  cooperation  to  interpret  these  cycles 
of  activity  as  competitive  or  cooperative  episodes.  To  identify  these  episodes, 
the  program  must  assign  goals  to  the  different  players  (e.g.,  (WART-TO-EXECUTE 
(AT  PI  FIRST-BASE))).  It  also  guesses  that  the  overall  goal  of  an  episode  is 
that  of  the  last  action  taken  by  a  player.  The  final  three  steps  search  the 
rule  space  to  discover  generalized  episodes  and  episode  goals  such  as  hit  and 
out.  These  concepts  are  far  removed  from  the  original  training  instances, 
but  because  the  previous  steps  have  properly  interpreted  the  data  in  terms  of 
goals  and  actions,  this  rule-space  search  is  easily  accomplished. 
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The  basic  purpose  of  interpreting  the  training  instances  is  to  extract 
information  that  is  useful  for  guiding  the  search  of  the  rule  space.  This  usually 
involves  converting  the  raw  training  instances  into  a  representational  form 
that  allows  syntactic  generalization  to  be  easily  accomplished  (see  below). 

Rule  Space 

Two  main  issues  are  related  to  the  rule  space  of  high-level  knowledge: 
What  is  the  space,  and  how  can  it  be  searched?  The  rule  space  is  usually 
defined  by  specifying  the  kinds  of  operators  and  terms  that  can  be  used  to 
represent  a  rule.  The  designer  of  a  learning  system  seeks  to  chooee  a  rule 
space  that  is  easy  to  search  and  that  contains  the  desired  rule  or  rules.  In  the 
sections  that  follow,  we  first  discuss  two  factors  that  influence  the  choice  of  a 
representation  language  for  the  rule  space:  the  kinds  of  inference  supported 
by  the  representation  and  the  single-representation  trick.  Then  we  survey 
the  four  methods  for  searching  the  rule  space.  We  conclude  the  discussion  of 
rule-space  issues  by  examining  problems  that  arise  when  the  representation  is 
found  to  be  inadequate  for  expressing  the  desired  rule  or  rules. 

Syntactic  rules  of  inference.  Both  the  expressiveness  of  a  repre¬ 
sentation  and  the  case  of  searching  the  rule  space  depend  on  the  kind  and 
complexity  of  the  inferences  supported  by  the  representation.  The  most  com¬ 
mon  inference  process  needed  for  learning  from  examples  is  generalization. 
We  say  that  one  description,  A,  is  more  general  than  another  description,  B, 
if  /l  applies  in  all  of  the  situations  in  which  B  applies  and  then  some  more. 
Thus,  the  set  of  situations  in  which  A  is  relevant  is  a  superset  of  the  set  of 
situations  in  which  B  is  relevant.  For  example,  the  rule  that  All  raven*  are 
black  is  more  general  than  the  rule  that  All  one-eyed  raven »  are  black,  since 
the  set  of  all  ravens  strictly  includes  the  set  of  one-eyed  ravens.  Often,  a 
description  A  is  more  general  than  a  description  B  because  A  places  fewer 
constraints  on  any  relevant  situations.  The  all  raven *  rule  omits  the  one-eyed 
constraint  and,  hence,  is  more  general. 

It  is  important  to  choose  a  representation  for  the  rule  space  in  which  gen¬ 
eralization  can  be  accomplished  by  inexpensive  syntactic  operations.  Predicate 
calculus,  for  example,  is  quite  amenable  to  certain  kinds  of  syntactic  gen¬ 
eralization.  Below  are  some  examples  of  syntactic  rules  of  inference  that 
accomplish  generalization  in  predicate  calculus.  Some  recent  work  in  learning 
(Larson,  1977;  Larson  and  Michalski,  1977;  Michalski,  1980)  has  sought  to 
identify  rules  of  inference  that  arc  particularly  useful  in  learning  systems.  It 
is  important  to  note  that  these  rules  of  inference  do  not  preserve  truth — the 
rules  are  indi  ive. 

1.  Turning  constants  to  variables.  Suppose  we  want  a  program  to 
discover  the  concept  of  a  flush  in  poker.  We  might  give  some  training 
instances  of  the  form: 
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Instance  1.  SU!T(ci ,  e/ui»)  A  SUlT'ej,  cluta)  A 
SU1T(cj,  chibi)  A  SUIT(c<,  c/u4j)  A 
SUIT(cs,  ehibt)  **  FLlfSH(ci ,  ej,  ej,  C4.es) . 

Instance  2.  SUIT(ci,  tpado)  A  SlilTfej,  spado)  A 
SUIT(cj,  spades)  A  ilUIT(c4,  jpadcj)  A 
SU1T(c6,  spado)  =4  FLUSH(ei ,  cj,ej,  C4,  et) . 

From  these,  the  program  could  hypothesize  the  rule 

Rule  1.  SUIT(c,,i)  A  SUIT(e2,z)  A  SU1T(cj,x)  A  SU1T(c4,i)  A 
SUIT(c»,x)  =*  FLUSH(ci, cj, ej, e4,c») . 

by  replacing  the  atomic  constants  of  clubs  and  spades  by  the  variable  z 
(where  x  stands  for  any  suit). 

2.  Dropping  conditions.  Suppose  again  that  we  are  teaching  a  program 
the  concept  of  a  Bush,  but  now  we  present  instances  of  the  form: 

Instance  l.  SUIT(ei, clubt)  A  RANK(ci,3)  A 
SUIT(cj,  ehiba)  A  RANK(c2,5)  A 
SUIT(cj,  c.'uis)  A  RANK(cs,7)  A 
SUIT(c4,  c/u4j)  A  RANK(c4i  10)  A 
SUIT(es,  clubt)  A  RANK(£6,  tin?) 

=*  FI,USH(el,ej,ej,  Ch.cs). 

.In  order  to  discover  rule  1,  the  program  must  not  only  turn  constants 
into  variables,  but  it  must  also  “forget"  all  of  the  RANK  predicates,  since 
rank  is  irrelevant.  This  can  be  accomplished  by  dropping  conditions.  Any 
conjunction  can  be  generalised  by  dropping  one  of  its  conditions.  We  can 
view  a  conjunctive  condition  as  a  constraint  on  the  set  of  possible  instances 
that  could  satisfy  the  description.  By  dropping  3  condition,  we  arc  removing 
a  constraint  and  generalising  the  rule. 

3.  Adding  options.  A  further  way  to  generalise  a  rule  is  to  add  another 
option  to  the  rule  so  that  more  instances  may  conceivably  satisfy  it.  Suppose 
we  are  trying  to  teach  a  program  the  concept  of  3  face  card  (i.c.,  jack,  queen, 
or  king).  We  might  give  examples  of  the  form: 


Instance  1.  RANK(ci,;act)  =*  FACE(ci) . 

Instance  2.  RANK(ci,  fueen)  =»  FACF.(ei). 

Instance  3.  RANK(^,*mj)  =»  FACE(d). 

The  program  can  discover  the  rule  by  forming  the  disjunction  of  the  pos¬ 
sibilities: 


Rule  2.  RANK(ci,;aek)  V  RANK(ci,  gtiecn)  V  RANK(ci,  king) 

=»  KACF.(ci) . 

Notice  that  this  decision  to  add  options  is  a  less  drastic  generalization  than 
that  of  turning  the  jack,  queen,  and  king  constants  into  a  single  variable  to 

Set 

Rule  3  (wrong).  RANk(et,v)  =*  FACE(cl). 
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An  alternative  to  ordinary  disjunction  is  what  Michalski  { 19H0)  terms  an 
internal  disjunction.  If  we  allow  sets  and  set  membership  in  our  repre¬ 
sentation,  we  can  express  our  instances  as 

Instance  i*.  RANK(et)  g  {jock}  =*  FACIC(ei). 

Instance  2'.  RANK(ei)  g  {(rueen)  =>  FACE(ei). 

Instance.V.  RANK(ei)  £  {iiny}  =0  FAGIi(ci). 

The  generalization  can  then  be  expressed  as 

Rule  2*.  RANK(ci)  g  {jack,  queen,  kin y}  =*  FACE(ci). 

This  latter  representation  is  more  compact. 

Similar  rules  of  generalization  can  be  defined  for  numerical  representa¬ 
tions  that  use  a  linear  combination  of  features,  as  follows: 

A.  Curve  fitting.  Suppose  a  program  is  attempting  to  discover  how  the 
output,  a,  of  a  system  is  related  to  two  inputs,  x  and  y.  The  program  is 
provided  with  training  instances  in  the  form  of  (i,  y,  z)  triples  that  show 
the  output  of  the  system  for  particular  values  of  the  inputs: 

Instance  1.  (0,  2,  7) . 

Instance  2.  (G,  —1,  10) . 

Instance  3.  (—1,  -5,  —16) . 

By  a  curve- fitting  technique,  such  as  least-squares  regression,  the  program 
fits  the  line 

Rule  I.  z  =  2i  +  3y  +  1 , 

or,  alternately,  the  ordered  triple  (z,y,2x  +  3y  +  1),  to  these  data.  This 
generalizes  the  relationship,  so  that  it  holds  for  many  more  (z,y,  :)  triples 
than  just  the  three  training  instances.  The  program  can  now  predict  the  z 
output  for  any  values  of  the  x  and  y  inputs.  This  process  is  analogous  to 
the  turning-constants-into-varinblcs  generalization  rule. 

5.  Zeroing  a  coefficient.  The  program  can  further  generalize  this  relation¬ 
ship  by  zeroing  the  y  coefficient  and  fitting  a  plane  to  the  three  trairing 
instances.  In  this  case,  it  obtains 

Rule  2.  2  =  2.5fii  -  3.90. 

Alternately,  the  ordered  triple  is  (i,  y,  2.59i  —  3.99).  (The  y  coordinate  can 
bo  anything.)  By  giving  y  the  coefiicient  of  zero,  the  program  has  dropped  it 
as  a  condition  and  reduced  the  dimensionality  of  tne  function  2  —  /'  (x,  y)  to 
make  it  2  =  (7(i).  The  program  has  decided  that  y  is  irrelevant  to  the  value 
of  2.  The  relationship  now  holds  for  an  even  larger  set  of  (i,  y,x)  triples. 

This  rule  is  analogous  to  the  dropping-condition  rule  of  generalization. 

Notice  that  these  rulc3  of  inference  correspond  to  particular  features  of 
the  representation  language.  For  example,  the  method  of  turning  constants 
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into  variables  makes  usd  of  free  variables,  the  method  of  adding  options  uses 
the  disjunction  operator,  and  the  cocfficicnt-ieroing  technique  makes  use  of 
the  multiplication  operator.  To  the  extent  that  the  representation  language 
has  fewer  of  these  features,  fewer  inference  rules  wilt  be  applicable  and, 
consequently,  the  search  of  the  rule  space  will  be  easier  to  accomplish.  But 
since  each  of  these  language  features  contributes  to  the  expressiveness  of  the 
representation,  the  designer  of  a  learning  system  faces  a  trade-o(T  between  the 
increased  expressiveness  of  the  representation  and  the  increased  difficulty  of 
searching  the  rule  space. 

The  single-representation  trick.  Another  factor  relating  to  the  dif¬ 
ficulty  of  searching  the  rule  space  (and  the  instance  space)  is  the  difference 
between  the  representation  used  for  rules  and  the  representation  used  for 
the  training  instances.  If  the  representations  for  the  rule  space  and  the 
instance  space  are  far  removed  from  each  other,  then  the  searches  of  the 
two  spaces  must  be  coordinated  by  complex  interpretation  and  experiment¬ 
planning  procedures.  One  trick  commonly  used  to  avoid  this  problem  is  to 
choose  the  same  representation  for  both  spaces.  Training  instances  arc  viewed 
literally  as  highly  specific  pieces  of  acquired  knowledge.  Suppose,  for  example, 
that  we  are  trying  to  teach  a  program  the  concept  of  a  pair  in  poker.  We 
want  the  program  to  learn  the  rule 

Rule  4.  3  eardi,  card-i  :  IlANK(cordi , x)  A  RANK(e«fdj,  x)  ■*  PAIR. 

(This  is  only  an  approximate  definition  of  PAIR.  An  exact  definition  would 
require  a  more  complex  representation  involving  equality.) 

As  was  shown  above,  specific  hands  could  be  represented  “naturally"  as 
sets  of  five  ordered  pairs— the  rank  and  suit  of  each  of  the  cards.  With  such 
a  representation  for  the  hand  made  up  of  the  2  of  clubs,  3  of  diamonds,  2  of 
hearts,  6  of  spades,  and  king  of  hearts,  wc  would  obtain  ’ 

Instance  l.  ((2,  clubs),  (3,  diamonds),  (2,  hearts),  (6,  spades),  (king,  hearts)} 
a*  PAIR. 

But  this  representation  makes  it  difficult  to  discover  the  concept  of  a  pair  in 
poker  with  the  syntactic  rules  of  inference  described  above.  A  less  natural,  but 
more  useful,  representation  would  describe  the  hand  m  predicate  calculus — 
the  same  representation  that  we  will  eventually  nceii  for  the  acquired  concept 
(rule  4).  Thus,  wc  would  say  of  our  hand 

Instance  l'.  3  Ci, e3,  e^.c*,  es  :  RANK(ci,2)  A  SUIT(ci ,  clubs)  A 

RANK(c.j,3)  A  SUITfcj,  diamonds)  A 

RANK(e3,2)  A  SU!T(c3,  hearts)  A 

ltANK(c<,6)  A  SUIT(elp  spades)  A 

RANK(co,  K)  A  SUIT(cj,  hearts)  =>  PAIR. 

Now  the  process  of  generalisation  merely  involves  dropping  the  SUIT  condi¬ 
tions  and  replacing  the  constant  2  by  a  variable  i.  Of  course,  there  are  many 
other  possible  generalisations  of  instance  l,l  and  the  search  of  the  rule  space 
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would  still  be  nontrivial.  The  advantage  of  using  the  single-representation 
trick  is  that  we  have  chosen  a  representation  that  allows  this  search  to  be 
accomplished  by  simple  syntactic  processes. 

The  problems  of  interpretation  and  experiment  planning  are  eased  when 
the  single-representation  trick  is  used.  Many  learning  programs  sidestep  these 
problems  completely  by  assuming  that  the  training  instances  are  provided  by 
the  environment  in  the  same  representation  as  used  for  the  rule  space.  In 
more  practical  situations,  the  interpretation  and  experiment- planning  routines 
serve  to  translate  between  the  raw  instances  (as  they  are  received  from  the 
environment)  and  the  derived  instances  (after  they  have  been  interpreted  as 
specific  points  in  the  rule  space). 

Methods  of  searching  the  rule  space.  Now  that  we  have  discussed 
the  issue  of  how  to  represent  the  rule  space,  we  can  turn  our  attention  to  the 
four  main  methods  that  have  been  used  to  search  the  rule  space.  All  of  these 
methods  maintain  a  set,  //,  of  the  currently  moat  plausible  rules.  They  differ 
primarily  in  how  they  refine  the  set  //  so  that  it  eventually  includes  the  desired 
points  in  the  rule  space.  A  useful  classification  of  search  methods  distinguishes 
methods  in  which  the  presentation  of  the  training  instances  drives  the  search 
(so-called  data-driven  methods )  from  those  methods  in  which  an  a  priori  model 
guides  the  search  (so-called  model-driven  methods). 

The  first  data-driven  method  is  the  version-space  method  (and  several 
related  techniques).  This  approach  uses  the  single-representation  trick  to 
represent  training  instances  as  very  specific  points  La  the  rule  space.  The 
set  H  is  initialised  to  contain  all  hypotheses  consistent  with  the  first  positive 
training  instance.  New  training  instances  arc  examined  one  at  a  time  and 
pattern-matched  against  H  to  determine  whether  the  hypotheses  in  H  should 
be  generalised  or  specialised. 

The  second  method,  also  a  data-driven  method,  does  not  use  the  single¬ 
representation  trick.  Instead,  special  procedures  (or  production  rules)  examine 
the  set  of  training  instances  rnd  decide  how  to  refine  the  current  set,  H, 
of  hypotheses.  The  program  can  be  viewed  as  having  a  set  of  hypothesis- 
refinement  operators.  In  each  cycle,  it  uses  the  data  to  choose  one  of  these 
operators  and  then  applies  it.  Lenat’s  (1976)  AM  system  is  an  example  of  this 
approach. 

The  third  approach  is  model-driven  generate  and  test.  This  method 
repeatedly  generates  and  tests  hypotheses  from  the  rule  space  against  the 
training  instances.  Model-based  knowledge  is  used  to  constrain  the  hypothesis 
generator  to  generate  only  plausible  hypotheses.  The  Mcta-DENDRAL  pro¬ 
gram  is  the  best  example  of  this  approach  (sec  Buchanan  and  Mitchell,  1978). 

finally,  the  fourth  approach  is  model-driven  schema  instantiation.  It  uses 
a  set  of  rule  schemas  to  provide  general  constraints  on  the  form  of  plausible 
rules.  The  method  attempts  to  instantiate  these  schemas  from  the  current 
set  of  training  instances.  The  instantiated  schema  that  best  fits  the  training 
instances  is  considered  the  most  plausible  ruie.  Dietterich’s  SPARC  program 
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(Dietterich,  1979;  Dietterich  and  Michalski,  in  press),  which  discovers  secret 
rules  in  the  card  game  Eleusis,  applies  the  schema-instantiation  method. 

Data-driven  techniques  generally  have  the  advantage  of  supporting  incre¬ 
mental  learning.  A  feature  of  the  version  space  method,  in  particular,  is 
that  the  //  set  can  easily  be  modified  to  account  for  new  training  instances 
without  any  backtracking  by  the  learning  program.  In  contrast,  model-driven 
methods,  which  test  and  reject  hypotheses  based  on  an  examination  of  the 
whole  body  of  data,  arc  difficult  to  use  in  incremental  learning  situations. 
When  new  training  instances  become  available,  model-driven  methods  must 
either  backtrack  or  search  the  rule  space  again,  because  the  criteria  by  which 
hypotheses  were  originally  tested  (or  schemas  instantiated)  have  changed. 

A  strength  of  model-driven  methods,  on  the  other  hand,  is  that  they 
tend  to  have  good  noise  immunity.  When  a  set  of  hypotheses,  //,  is  tested 
against  noisy  training  instances,  the  model-driven  methods  need  not  reject  a 
hypothesis  on  the  basis  of  one  or  two  counterexamples.  Since  the  whole  set  of 
training  instances  is  available,  the  program  can  use  statistical  measures  of  how 
well  a  proposed  hypothesis  accounts  for  the  data.  In  data-driven  methods,  //  is 
revised  each  time  on  the  basis  of  the  current  training  instance.  Consequently, 
a  single  erroneous  instance  can  cause  a  large  perturbation  in  II  (from  which 
it  may  never  recover).  One  approach  that  allows  data-driven  methods  to 
handle  noise  is  to  make  very  slight,  conservative  changes  in  //  in  response  to 
each  training  instance.  This  minimizes  the  effect  of  any  erroneous  training 
instances,  but  it  causes  the  learning  system  to  learn  much  more  slowly. 

The  problem  of  new  terms.  In  some  learning  problems,  the  program 
can  assume  that  the  desired  rule  or  rules  exist  somewhere  in  the  rule  space. 
Consequently,  the  search  has  a  well-dciincd  goal.  In  many  situations,  however, 
there  is  no  such  guarantee,  and  the  learning  program  must  confront  the 
possibility  that  its  representation  of  the  rule  space  is  inadequate  and  should 
be  expanded.  This  is  called  the  problem  of  new  terms. 

One  approach  to  expanding  the  rule  space  is  to  add  new  terms  to  the 
representation.  Conside.  again  the  problem  ol  teaching  a  program  the  concept 
of  a  pair  in  poker.  In  the  section  above,  the  program  was  able  to  represent  the 
pair  concept  by  using  a  predicate-calculus  representation  with  the  suit  and 
rank  terms.  Such  a  representation  would  not  permit  the  program  to  discover 
the  concept  of  a  straight,  however.  One  way  to  represent  the  straight  concept 
would  be  to  create  a  new  terra  called  SUCC(z,y),  which  is  true  if  and  only  if 
*  =  y  +  1.  Now  the  straight  concept  can  be  represented  as: 

RANK(ci,  r | )  A  RANK(c5,rs)  A  ltANK(e3,  Vj)  A  RANK(ci,r.l)  A  RANK(c5,r5)  A 
SUCC(r,,r2)  A  SUCC(r2,  r3)  A  SUCC(r3,  r.,)  A  SUCC(r,,rs). 

The  problem  of  defining  new  terms  is  quite  difficult  to  solve.  An  advantage 
of  the  hypothesis-refinement  operator  approach  to  searching  the  rule  space  is 
that  it  is  fairly  easy  to  incorporate  operators  that  create  new  terms.  The 
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BACON  (Langley,  1980)  anti  AM  programs  both  have  operators  that  create 
new  terms  by  combining  and  refining  existing  terms. 


Experiment  Planning 

Once  the  learning  element  has  searched  the  rule  space  and  developed 
a  set,  H,  of  plausible  hypotheses,  the  program  may  need  to  gather  more 
training  instances  to  test  and  refine  them.  When  the  instance  space  and  the 
rule  space  are  represented  in  very  different  ways,  the  process  of  determining 
which  training  instances  arc  needed  and  how  they  can  be  obtained  can  be 
quite  involved.  Suppose,  for  example,  that  a  genetics  learning  program  is 
attempting  to  discover  which  portions  of  DNA  are  important.  To  test  a  high- 
level  hypothesis  (or  several  hypotheses),  it  may  be  necessary  to  plan  a  very 
involved  experiment  to  synthesize  a  particular  strand  of  DNA  and  insert  it 
into  the  appropriate  bacterial  cells  to  observe  the  resulting  behavior  of  the 
cells. 

The  AM  program  is  an  example  of  an  A1  learning  program  that  performs 
some  experiment  planning.  After  one  of  AM's  refinement  operates  creates 
a  new  concept,  AM  must  gather  examples  of  that  concept  to  evaluate  and 
refine  it.  Several  techniques  arc  used  to  generate  good  training  instances, 
for  example,  by  symbolically  instantiating  the  con  -.opt  definition  or  by  inher¬ 
iting  examples  from  more  general  or  more  specific  concepts.  AM  has  a  spe¬ 
cial  body  of  heuristics  for  locating  positive  and  negative  boundary  examples 
(i.e.,  examples  that  barely  succeed,  or  barely  fail,  to  be  instances  of  the  con¬ 
cept). 

Taxonomy  of  Work  in  Learning  from  Example t 

Now  that  we  have  described  the  two-space  model,  we  present  a  rough 
taxonomy  of  work  in  the  area  of  learning  from  examples.  Several  subareas 
of  research  have  developed  within  this  area,  ranging  from  philosophically 
oriented  inductive  iearning  to  highly  engineering-orients!  pattern-classification 
work.  These  different  areas  C3n  be  characterized  by  two  components  of  the 
simple  learning  model  presented  in  Article  XIV.A:  the  representation  used  in 
the  knowledge  base  and  the  task  that  the  performance  element  carries  out. 
In  the  remainder  of  this  chapter,  a  separate  article  is  devoted  to  each  of  these 
subareas. 

Systems  that  use  numerical  representations.  Researchers  in  electri¬ 
cal  engineering  and  systems  theory  have  developed  learning  methods  that 
represent  acquired  knowledge  in  tlic  form  of  polynomials  and  matrices.  The 
performance  elements  of  those  learning  systems,  which  arc  usually  called  adap¬ 
tive  systems,  typically  perform  tasks  such  as  pattern  classification,  adaptive 
control,  and  adaptive  filtering.  The  strengths  of  these  adaptive  methods  are 
that  they  can  be  used  in  noisy  environments,  in  environments  whose  properties 
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are  changing  rapidly,  anti  in  situations  where  analytic  solutions  based  on  clas¬ 
sical  systems  theory  arc  unavailable.  We  include  an  article  on  this  subject 
because  of  its  historical  relationship  to  Ai  and  because  ot  the  possibility  that 
useful  hybrid  systems  may  be  constructed  in  the  future. 

Systems  that  use  symbolic  representations.  Most  AI  work  on  learn¬ 
ing  has  used  symbolic  representations  such  as  feature  vectors,  first-order  predi¬ 
cate  calculus,  and  production  rules  to  represent  the  knowledge  acquired  by  'he 
learning  element.  It  is  useful  to  classify  this  work  according  to  the  complexity 
of  the  task  being  performed  by  the  learning  system: 

1.  Learning  single  concepts.  The  simplest  performance  task  is  to  classify  new 
mstaners  according  to  whether  they  are  instances  of  a  single  concept. 

I  he  problem  of  learning  single  concepts  has  received  a  'ot  of  attention 
and  is  probably  the  best  understood  learning  task  in  AI. 

2.  Learning  multiple  concepts.  Many  performance  tasks  involve  the  use  of 
a  set  of  concepts  that  operate  independently  Disease  diagnosis,  for 
example,  is  a  task  in  which  the  program  seeks  to  assign  one  or  more 
disease  classes  to  a  patient.  The  profile  of  learning  a  set  of  concepts 
has  received  some  attention  in  AI.  The  Meta-DKN'DItAL  and  AM  systems, 
for  example,  discover  many  concepts  in  order  to  describe  their  training 
instances  and  guble  the  performance  element. 

3.  Learning  to  perform  multipie-step  loses.  The  most  complex  performance 
tasks  for  which  learning  techniques  have  been  developed  are  relatively 
simple  planning  tasks  that  require  the  performance  element  to  apply 
a  sequence  of  operators  to  perform  the  task.  Unlike  the  multiple,  but 
independent,  concepts  used  in  Meta-DKNDRAI.  and  AM,  the  rules  in 
these  systems  must  be  chained  together  into  a  sequence.  Consequently, 
many  difficult  problems  of  integration  and  credit-assignment  arise. 

References 

Simon  and  Lea  (11)71)  describe  the  two-space  model  of  rule  induction. 
Diettcrich  and  Michalsk.  (1981)  provide  some  perspectives  on  systems  that 
learn  from  examples.  Sec  also  Buchanan,  Mitchell,  Smith,  at:d  Johnson  (1077). 


D2.  Learning  in  Control  and  Pattern  Recognition  Systems 


THERE  ARE  many  applications  in  engineering  and  science  for  which  learning 
systems  have  been  developed.  These  systems,  usually  called  adaptive  systems, 
are  useful  when  classical  systems  techniques  cannot  be  applied  because  of 
insufficient  knowledge  about  the  underlying  system.  Such  situations  often 
arise  in  extremely  noisy  and  rapidly  changing  environments. 

Classical  systems  theory  addresses  itself  to  problems  in  the  design  and 
analysis  of  systems,  where  a  system  is  viewed  abstractly  as  an  operator  that 
maps  a  vector  of  inputs,  x,  to  a  vector  of  outputs,  y.  Two  important  engineer¬ 
ing  problems  for  which  learning  systems  have  been  developed  arc  control  and 
pattern  recognition. 

Consider  the  control  problem  shown  in  Figure  D2-1.  The  system  is  an 
automobile  engine.  The  inputs — in  this  case,  control  inputs — arc  the  amount 
of  gasoline  and  the  setting  of  the  spark-plug  advance.  The  single  output  is 
the  speed  of  the  engine.  The  control  problem  is  to  determine  the  settings 
of  the  inputs  over  time,  so  that  the  output  follows  a  particular  curve.  We 
want  the  speed  of  the  engine  to  track  the  desired  speed  as  commanded  by  the 
driver  of  the  automobile.  If  we  have  a  mathematical  model  of  the  engine — say, 
as  a  set  of  differential  equations  relating  Z|  and  to  y — we  can  often  solve 
this  control  problem.  To  obtain  the  model,  we  can  usually  inspect  the  system 
directly  and  apply  the  laws  of  physics.  Out  in  complex,  time-varying  systems, 
such  an  approach  may  be  impossible.  Instead,  it  may  be  necessary  to  identify 
the  system — that  is,  construct  a  model  by  observing  the  system  in  operation 
and  finding  an  empirical  relationship  between  the  inputs  and  the  outputs. 

Pattern  recognition— the  other  task  for  which  adaptive  learning  is  useful — 
also  can  be  viewed  as  a  system-identification  problem.  The  pattern-classifi¬ 
cation  system  shown  in  Figure  D2-2  takes  an  input  object — represented  as 
a  vector,  x,  of  features — and  maps  it  into  one  of  m  pattern  classes.  The 
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Figure  D2-1.  A  simple  control  problem. 
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Figure  D2-2.  A  simple  pattern-classification  problem. 

archetypal  pattern-classification  problem  is  optical  character  recognition,  in 
which  the  inputs  arc  images  of  handwritten  or  printed  characters  and  the 
output  is  a  classification  of  each  image  as  one  of  the  letters,  numerals,  or 
punctuation  symbols.  Suppose  we  want  to  build  a  computer  system  that  can 
recognise  characters.  We  have  available  an  unknown  system — in  this  case,  a 
person — that  <_an  perform  the  task  reliably.  If  we  can  identify  the  system,  we 
will  then  have  a  computer  model  that  can  recognize  handwritten  characters. 

Figure  D2 -3  illustrates  the  general  setup  for  adaptive  system  identifica¬ 
tion.  The  unknown  system  and  the  model  arc  configured  in  parallel.  Their 
outputs — the  true  output,  y,  and  the  estimated  output,  y — arc  compared, 
and  the  error,  e,  is  fed  back  to  the  learning  element,  which  then  modifies  the 
model  appropriately.  In  the  terminology  of  our  simple  learning-system  model, 
the  unknown  system  is  the  environment.  It  provides  training  instances,  in  the 
form  of  (x,y)  pairs,  to  the  learning  element.  The  learning  element  modifies 
certain  parts  of  the  model  (i.c.,  the  knowledge  base),  so  that  the  model  system 
(i.e.,  the  performance  clement)  more  accurately  models  the  unknown  system. 

Conceptually,  therefore,  adaptive  system  identification,  adaptive  control, 
and  pattern  recognition  arc  all  problems  of  learning  from  examples.  The 


Figure  D2-3.  Adaptive  system  identification. 
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unknown  system  provides  the  training  instances  and  the  performance  stan¬ 
dard  (i.e.,  the  true  y'vaiues). 

In  this  article,  we  discuss  the  methods  that  have  been  used  to  accomplish 
this  learning.  We  have  divided  the  methods  into  four  groups  according  to  the 
representations  that  are  used  to  model  the  unknown  system: 

1.  Statietical  algorithm e,  which  employ  probability  density  functions  to  create 
a  Bayesian  decision  procedure; 

2.  Parameter  learning,  which  uses  a  vector  of  parameters  and  a  linear  model; 

3.  Automata  learning,  which  uses  stochastic  and  futiy  automata  (discussed 
below)  to  model  the  unknown  system;  and 

4.  Structural  learning,  which  uses  pattern  grammars  and  graphs  to  represent 
classes  of  objects  for  pattern  classification. 

Statutical  Learning  Algorithm* 

In  pattern  recognition  (and  sometimes  in  control),  it  is  possible  to  view 
the  unknown  system  as  making  a  decision  to  assign  the  input,  x,  to  one 
class,  y,  out  of  m  classes.  By  defining  a  toss  function  that  penalises  incorrect 
decisions  (i.e.,  decisions  in  which  y  differs  from  y),  a  minimum-average-loss 
Bayes  classifier  can  be  used  to  model  the  unknown  system.  The  problem  of 
identifying  the  unknown  system  then  reduces  to  the  problem  of  estimating  a 
set  of  parameters  for  certain  probability  density  functions.  These  parameters, 
such  as  the  mean  vector  and  the  variance-covariance  matrix,  can  be  estimated 
from  the  training  instances  in  a  fairly  straightforward  fashion  (see  Duda  and 
Hart,  1973). 

In  the  terminology  of  Simon  and  Lea  (1974),  the  set  of  all  possible  x  vec¬ 
tors  forms  the  instance  space,  and  the  set  of  possible  values  for  the  parameters 
of  the  probability  distributions  forms  the  rule  space.  The  rule  space  is  searched 
by  direct  calculation  from  the  training  instances.  The  instance  space  is  not 
actively  searched. 

Unfortunately,  these  methods  rely  on  assuming  a  particular  form  (c.g., 
multivariate  normal)  for  the  probability  distributions  in  the  model.  These 
assumptions  frequently  do  not  hold  in  real-world  problems.  Furthermore,  the 
computational  costs  of  the  estimation  may  be  very  high  when  there  are  many 
features. 

Parameter  Learning 

In  parameter  learning,  a  fixed  functional  form  is  assumed  for  the  unknown 
system.  This  functional  form  has  a  vector  of  parameters,  w,  that  must  be 
determined  from  the  training  instances.  Unlike  the  statistical  methods,  there 
is  little  or  no  probabilistic  interpretation  for  the  unknown  parameters  and, 
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consequently,  probability  theory  provides  no  guidance  for  estimating  them 
from  the  data.  Instead,  some  sort  of  criterion,  usually  the  squared  error 
(y  ~  y)2  averaged  over  all  training  instances,  is  minimised.  The  rule  space 
is  thus  a  space  of  possible  parameter  vectors,  and  it  is  searched  by  hill¬ 
climbing  (also  called  gradient  descent)  to  find  the  point  that  minimizes  the 
error  between  the  model  and  the  unknown  system. 

The  most  popular  form  assumed  for  the  unknown  system  is  a  linear 
functional: 

y  =  WX  =  ^2  WiXi  . 
i 

The  output  is  assumed  to  be  a  linear  combination  of  the  input  feature  vector, 
x,  with  a  weight  vector,  w.  The  elements  of  the  weight  vector  are  the  unknown 
parameters.  The  rule  space  is  thus  the  space  of  all  possible  weight  vectors, 
known  ?s  the  weight  space. 

An  Important  special  case  arises  when  the  unknown  system  is  a  binary 
pattern  classification  system  similar  to  the  system  shown  earlier  in  Figure 
D2-2,  In  binary  pattern  classification,  the  classifier  must  indicate  in  which 
of  the  two  pattern  classes  the  input  pattern,  x,  belongs.  This  is  typically 
accomplished  by  taking  the  output,  y,  of  a  linear  functional  and  comparing 
it  to  a  threshold,  6: 

If  y  >  6,  then  x  is  in  class  1. 

If  y  <  4,  then  x  is  in  class  2. 

Usually,  the  instance  apace  is  normalized,  so  that  the  threshold  b  is  zero.  This 
linear-discriminant  function  can  be  thought  of  as  a  hyperptane  that  splits  the 
instance  space  into  two  regions  (class  1  and  class  2).  For  example,  if  x - - 
(it,!*)  is  a  two-dimensional  feature  vector  and  w  =  (—1,2),  the  instance 
space  is  split  as  shown  in  Figure  D2-4.  j 

The  learning  problem  of  finding  w  can  thus  be  viewed  as  the  problem 
of  finding  a  hyperplane  that  separates  training  instances  of  class  1  from 
training  instances  in  class  2.  When  it  is  possible  to  find  such  a  hyperplahe, 
the  training  instances  are  said  to  be  linearly  separable.  Often,  however,  the 
training  instances  arc  not  linearly  separable.  In  such  cases,  we  must  cither  use 
a  more  complex  functional  form,  such  as  a  quadratic  function,  or  else  settle 
for  the  hyperplar.e  that  makes  the  fewest  errors  on  the  average. 

•  How  can  the  desired  hyperplane,  or,  equivalently,  the  desired  weight 
vector,  be  found?  We  describe  three  basic  algorithms  for  computing  the  weight 
vector.  The  first  two  algorithms  arc  hill-climbing  methods  that  process  the 
training  instances  one  at  a  time.  After  each  training  instance,  x*,  the  weight 
vector,  Wfc,  is  updated  to  give  w*+i. 

The  first  algorithm,  called  the  fixed-increment  perceptron  algorithm,  seeks 
to  minimize  the  classification  errors  made  by  the  model.  If  x*  is  an  instance 
of  class  l  and  y  —  w*x*  is  less  than  0,  instead  of  greater  than  0,  an  error 
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+ :  Instance  of  class  i 
- :  Instance  of  class  2 


Figure  D2-4.  An  example  of  a  linear-discriminant  function. 

has  been  made.  The  magnitude  of  this  error  is  e  =  0  —  wkxk,  that  L.,  the 
difference  between  the  desired  value  for  the  output  of  the  system  (y  —  0)  and 
the  value  computed  by  the  model  (y  =  w*xk).  This  is  usually  written  as  the 
perceptron  criterion, 

J,  =  -w*x* , 

and  the  goal  of  learning  is  to  minimize  J,.  The  fixed-increment  algorithm 
updates  w»  whenever  J,  >  0  according  to 

^»+l*>w»  +  x*.  (1) 

We  can  think  of  J,,  as  a  surface  over  the  weight  apace,  the  space  of  possible 
values  for  the  weight  vector  w  (see  Fig.  D2-5).  Mathematical  analysis  shows 
that  x  can  be  viewed  as  a  vector  in  this  weight  space  (as  well  as  in  instance 
space)  pointing  in  the  direction  of  steepest  descent  for  Jp.  Thus,  this  algorithm 
takes  a  fixed-size  step  in  the  direction  of  steepest  descent. 

Similarly,  if  x*  is  in  class  2  and  w*x*  >  0,  an  error  has  been  made.  The 
solution  is  to  adjust  w  as 


w*+i  w*  -x*. 

Equivalently,  all  training  instances  in  class  2  can  be  replaced  by  their  nega¬ 
tives,  and  all  instances  can  be  processed  as  though  they  were  in  class  1. 
Equation  (1)  can  then  be  used  to  perform  the  entire  learning  process. 

The  fixed-increment  algorithm  converges  in  a  finite  number  of  steps  if  the 
training  instances  are  linearly  separable.  It  has  been  shown  for  the  two-class 
case  that  the  number  of  training  instances  should  be  at  least  twice  the  number 
of  features  in  the  instance  space  (sec  Nilsson,  1965). 
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weight  space 


Figure  D2-5.  A  schematic  diagram  of  the  perccptron  algorithm. 

Historically,  the  fixed-increment  algorithm  is  associated  with  Rosenblatt's 
(1957,  1962)  perccptron,  which  was  developed  within  the  study  of  bionics  and 
neural  mechanisms.  The  simplest  perceptron,  shown  in  Figure  D2-6,  is  a 
device  that  assigns  patterns  to  one  of  two  classes.  It  consists  of  an  array 
of  sensory  units  connected  in  a  random  way  to  an  array  of  unmodifiable 
threshold  units,  each  of  which  computes  some  desired  feature  of  the  sensory 
array  and  produces  a  +1  or  —1  output,  depending  on  whether  the  feature 
is  present  or  absent.  The  outputs  of  these  feature-extraction  units  are  then 
connected  to  a  modifiable  unit  that  weights  each  input  and  sums  the  result 
(i.c.,  computes  wx).  The  resulting  value  is  compared  with  a  threshold,  and  the 
perceptron  produces  an  output  of  + 1  if  wx  is  greater  than  the  threshold  and 
—  1  otherwise.  Thus,  the  simplest  perceptron  implements  a  linear-discriminant 
function.  The  original  publication  of  the  perccptron  model  sparked  a  large 


Figure  D2-8.  The  simplest  form  of  perccptron. 
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amount  of  research,  and  a  fair  amount  of  speculation,  concerning  the  potential 
for  building  intelligent  machines  from  pcrccptrons.  Minsky  and  Papcrt  (1969) 
attempted  to  quiet  this  speculation  by  proving  several  theorems  about  the 
limits  of  perccptron-based  learning.  The  introduction  to  their  book  provides 
severa'  criticisms  of  AI  learning  research  that  remain  valid  today. 

The  fixed-increment  perceptron  algorithm  can  be  improved  in  several  ways 
by  choosing  how  far  in  the  direction  of  the  gradient  to  go  at  each  step.  The 
LMS  (least-mean-square)  algorithm  (Widrow  and  FIoiT,  I960),  for  example, 
updates  w  according  to 


w*+1  =  wfc  +  pekxk  , 

where  p  is  a  positive  value  and  e*  is  the  magnitude  of  the  error,  that  is, 
— w*x*.  This  algorithm  tends  to  minimize  the  mean-squared  error 

J.  =  £  (w*x*)j 

k 

even  when  the  classes  are  not  linearly  separable.  The  algorithm  is  also  very 
easy  to  implement. 

More  robust,  but  harder  to  compute,  algorithms  are  based  on  tradi¬ 
tional  linear-regression  and  linear-programming  techniques  (see  Duda  and 
Hart,  1973).  Given  a  set  of  training  instances,  linear  regression  can  be  used 
to  minimize  J,.  The  weight  vector  is  computed  from  the  data  as 

w  =  (XrX)-‘XTy, 

where  y  is  the  true  output  of  the  unknown  system  and  X  is  a  matrix  of  train¬ 
ing  instances,  one  instance  in  each  row.  Unfortunately,  this  method  requires 
computing  the  pseudo-inverse  (XrX)~lXr  of  X,  which  is  an  expensive  step. 
Less  costly  recursive  algorithms  have  been  developed  that  can  compute  w 
incrementally  as  the  training  instances  become  available,  rather  than  collect¬ 
ing  all  of  the  instances  and  computing  w  once  and  for  all  (Goodwin  and  Payne, 
1977). 

Linear-programming  techniques  can  be  used  to  minimize  the  perceptron 
criterion,  Jp.  These  methods  also  conduct  a  hill-climbing  search  of  the  weight 
space.  Further  details  are  available  in  Duda  and  Hart  (1973). 

Some  of  these  linear-discriminant  algorithms  can  be  modified  slightly  to 
put  them  on  sound  statistical  foundations.  The  regression  techniques,  for 
example,  can  be  adjusted  to  converge  in  the  limit  to  an  optimum  Bayes  clas¬ 
sifier.  Their  rate  of  convergence  is  slower  than  the  unmodified  algorithms. 
Consequently,  the  simpler,  faster  algorithms  shown  above  are  often  chosen  in 
favor  of  the  statistically  more  rigorous  methods. 

All  of  these  methods  for  finding  discriminant  functions  can  be  general¬ 
ized  to  handle  classification  problems  for  more  than  two  classes.  Typically, 
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a  separate  discrimination  function  is  learned  for  each  of  m  classes,  and  x  is 
classified  to  that  class  »  for  which  the  value  of  the  discriminant  function  /i(x) 
is  largest.  Another  approach  to  multiple-class  problems  is  to  perform  a  multi¬ 
stage  classification  in  which  x  is  first  classified  into  one  of  a  few  classes  and 
then  each  of  these  is  in  turn  split  into  subclasses  until  x  is  properly  classified. 
By  decomposing  the  classification  problem  into  subproblems,  other  a  priori 
knowledge  about  different  classes — and  the  features  relevant  to  those  classes — 
can  be  incorporated  into  the  system.  Moat  large,  multicategory  problems  do 
not  lend  themselves  to  straightforward  general  solutions.  Instead,  the  struc¬ 
ture  and  organisation  of  the  classification  strategy  are  usually  highly  depen¬ 
dent  on  the  particular  problem  and  domain-specific  knowledge.  Consequently, 
many  of  these  classification  problems  overlap  problems  in  AI. 


Learning  Automata 

An  alternate  representation  for  an  unknown  system  is  as  a  finite-state 
automaton  (Fu,  1970b).  The  goal  is  to  find  a  finite-state  automaton  whose 
behavior  imitates  that  of  the  unknown  system.  Two  quite  similar  approaches 
have  been  pursued.  One  models  the  unknown  system  as  a  deterministic  finite- 
state  machine  with  randomly  perturbed  inputs.  The  learning  program  is 
given  an  initial  state  transition  probability  matrix,  M,  which  tells  overall  for 
each  state,  ft,  what  the  probability  is  that  the  next  state  will  be  q From 
M,  an  equivalent  deterministic  machine  can  be  derived,  and  the  probability 
distribution  of  the  input  symbols  can  be  determined.  This  approach  requires 
that  the  internal  states  of  the  unknown  system  can  be  precisely  observed  and 
measured. 

A  second  approach  models  the  unknown  system  as  a  stochastic  machine 
with  a  random  transition  matrix  for  each  possible  input  symbol.  Reinforce¬ 
ment  techniques  are  applied  to  adjust  t.'  transition  probabilities.  Unfortu¬ 
nately,  this  requires  a  large  amount  of  training  information  in  order  to  exercise 
all  possible  transitions.  As  with  the  first  approach,  assumptions  about  the 
observability  of  all  internal  states  must  be  made. 

Fussy  automata  based  on  Zadch’s  fuzzy  set  concept  provide  an  alternate, 
but  similar,  approach  to  that  used  with  stochastic  automata  (Wee  and  Fu, 
1969).  Set-membership  criteria  are  applied,  rather  than  probabilistic  con¬ 
straints,  in  the  selection  of  transitions  and  outputs.  Fussy  automata  are  also 
able  to  make  higher  order  transitions  than  stochastic  automata  and,  conse¬ 
quently,  they  can  usually  learn  faster. 

The  basic  ideas  of  automata  learning  have  been  extended  to  take  into 
account  the  interactions  of  a  number  of  automata  operating  in  the  same  envi¬ 
ronment.  Such  automata  may  interact  in  cither  cooperative  or  competitive 
modes.  This  has  led  to  the  formulation  and  study  of  automata  games  (Fu, 
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Automata  methods  have  the  advantage  over  parameter-learning  methods 
in  that  they  :  o  not  require  that  there  be  a  performance  criterion  with  a  unique 
minimum  point.  Furthermore,  automata  provide  a  more  expressive  repre¬ 
sentation  for  describing  the  unknown  system.  The  principal  disadvantage 
of  automata  learning  methods  is  that  they  are  relatively  slow  compared  to 
parameter  learning  techniques.  In  addition,  they  are  usually  suitable  only  fc- 
application  in  stationary  (i.c.,  non-time-varying)  environments.  Consequently, 
automata  methods  have  not  yet  seen  much  practical  application. 

Structural  Learning 

Structural  learning  techniques  have  been  used  primarily  in  situations  in 
which  the  objects  to  be  classified  have  imp<  taut  substructure  (Fu,  1974).  The 
parametric  linear-discriminant  approaches  described  above  can  represent  only 
the  global  features  of  objects.  By  employing  pattern  graphs  and  grammars, 
important  substructures,  such  as  the  pen  strokes  that  make  up  a  character 
and  the  phonemes  that  make  up  a  spoken  word,  can  be  represented  along  with 
their  interrelationships.  A  first  step  in  setting  up  a  structural  learning  scheme 
involves  identifying  a  set  of  primitive  structural  elements  associated  with  the 
problem.  These  primitives  may  be  thought  of  as  the  alphabet  for  describing 
all  possible  patterns  associated  with  the  application.  They  need  to  be  higher 
level  objects  than  simple  scalar  measurements  (e.g.,  characters,  shapes,  and 
phonemes  instead  of  height,  width,  and  curvature).  Legal  and  recognisable 
pai.Vrns  are  formed  from  combinations  of  the  primitives  according  to  certain 
syntactic  rules. 

Formal  language  theory  provides  a  theoretical  framework  that  accom¬ 
modates  the  structural  or  descriptive  formulation  of  pattern  recognition.  Here, 
the  alphabet  corresponds  to  the  set  of  structural  primitives.  A  number  of  for¬ 
malisms  have  been  used  to  express  structural  descriptions.  In  linguistic  terms, 
a  pattern  may  be  thought  of  as  a  string  orsentence,  and  a  grammar  may  be 
associated  with  each  pattern  class.  The  grammar  controls  the  structure  of 
the  language  in  such  a  way  that  the  sentences  (patterns)  produced  belong 
exclusively  to  a  particular  pattern  class;  a  grammar  is  therefore  needed  for 
each  pattern  class.  Parsing  techniques  can  help  determine  whether  a  sentence 
(pattern)  is  grammatically  correct  for  a  given  language.  Both  deterministic 
and  stochastic  grammars  have  been  employed  in  pattern  classification.  (See 
Article  X1I1.E3  for  a  discussion  of  grammatical  approaches  to  image  under¬ 
standing.) 

Stochastic  grammars  (see  Article  XIV.DSe)  have  been  used  in  an  attempt 
to  accommodate  the  possibilities  of  ambiguity  and  error  in  pattern  descrip¬ 
tion.  These  grammars  make  it  possible  for  probabilistic  assignments  to  be 
made.  Before  such  a  grammar  can  be  used  for  classification,  the  production 
probabilities  must  be  determined,  for  example,  by  "learning”  them  from  a  set 
of  training  examples. 
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There  arc  still  several  difficulties  associated  with  the  structural  approach 
to  pattern  classification.  In  contrast  to  the  statistical  and  parameter  learning 
methods,  very  few  practical  structural  training  algorithms  have  presently  been 
proposed.  The  problem  of  learning  a  grammar  from  training  instances  is 
called  grammatical  inference.  Article  yiV.DSe  describes  the  current  state  of 
work  in  that  area.  In  addition  to  the  problem  of  learning  the  grammar,  the 
steps  of  segmentation  into  primitives  and  formation  of  structural  descriptions 
are  only  partly  solved. 

Relevance  for  Artificial  Intelligence 

This  survey  of  learning  systems  in  engineering  shows  that  many  of  the 
problems  addressed  are  analogous  to  those  encountered  in  the  design  of  AI 
learning  systems.  Engineering  systems  are  particularly  adept  at  handling 
noisy  training  instances — a  problem  that  few  Ai  systems  have  addressed.  It 
has  also  been  possible  to  develop  detailed  analyses  of  these  learning  algo¬ 
rithms,  including  convergence  proofs  and  investigations  of  their  statistical 
foundations. 

The  primary  drawback  of  these  methods  is  their  reliance  on  simple  feature- 
vector  representations.  Although  there  are  many  practical  applications  for 
which  these  representations  suffice,  most  problems  of  interest  to  AI  research¬ 
ers  require  more  expressive  representations.  The  more  recent  attempts  to  use 
automata  and  pattern-grammar  representations  are  much  more  relevant  to  AI 
research. 

Some  aspects  of  the  work  in  engineering  may  be  important  for  AI  reser.rch- 
ers.  In  addition  to  work  on  the  problem  of  noise,  some  progress  has  been 
made  on  solving  the  problem  of  choosing  a  good  set  of  features  with  which  to 
perform  the  learning  process.  One  approach  is  to  estimate  the  discriminatory 
ability  of  each  feat-re  given  choices  of  the  other  features.  Dynamic-program¬ 
ming  techniques  can  help  determine  a  good  ordering  of  the  features  (from 
most  relevant  to  least  relevant).  A  second  interesting  approach — called  dimen¬ 
sionality  reduction — is  to  take  a  large  set  of  features  and  compute  a  new, 
smaller  set  by  forming  linear  combinations  of  the  old  features.  The  Karhuncn- 
Loive  expansion  can  be  used  to  create  such  derived  features  (see  Fu,  1970a, 
and  Article  XIII.CS). 
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D3.  Learning  Single  Concepts 


MANY  PROCRAMS  have  been  developed  that  are  able  to  learn  a  single  concept 
from  training  instances.  This  article  describes  the  single-concept  learning 
problem  and  discusses  a  few,  selected  learning  programs  that  give  a  sense  of 
the  techniques  that  have  been  applied  to  this  problem. 

What  does  it  mean  to  learn  a  concept  from  training  instances?  The  term 
concept  is  used  quite  loosely  in  the  A1  literature.  In  this  article,  we  take 
a  concept  to  be  a  predicate,  expressed  in  some  description  language,  that 
is  TRUE  when  applied  to  a  positive  instance  and  FALSE  when  applied  to  a 
negative  instance  of  the  concept.  A  concept  is  thus  a  predicate  that  partitions 
the  instance  space  into  positive  and  negative  subsets.  For  example,  the  concept 
of  straight  can  be  thought  of  as  a  predicate  that  indicates,  for  any  poker  hand, 
whether  or  not  that  hand  is  a  straight. 

The  single-concent  learning  problem  is  the  problem  of  discovering  such  a 
concept  predicate  from  training  instances — that  is,  from  a  sample  of  positive 
and  negative  instances  in  the  instance  space.  The  standard  solution  to  this 
problem  is  to  provide  the  learning  program  with  a  space  of  possible  concept 
descriptions  that  the  learning  program  searches  to  Gnd  the  desired  concept 
description  (see  Article  XlV.Dl). 

Formally,  the  single-concept  learning  problem  ran  be  stated  as  follows: 

Given:  (1)  A  representation  language  for  concepts.  This  implicitly 
I  defines  the  rule  space:  the  space  of  all  concepts  repre¬ 

sentable  in  the  language. 

(2)  A  set  of  positive  (and  usually  negative)  training  instances. 

In  most  work  to  date,  these  training  instances  arc  noise  free 
and  classified  in  advance  by  the  teacher. 

Find:  The  unique  concept  in  the  rule  space  that  best  covers  all  of 

the  positive  and  none  of  the  negative  instances.  Most  work 
to  date  assumes  that  if  enough  instances  are  presented,  ex¬ 
actly  one  concept  exists  that  is  consistent  with  the  training 
instances. 

To  gain  insight  into  the  origin  of  the  single-concept  learning  problem,  it 
is  useful  to  examine  the  performance  tasks  that  make  use  of  the  concept  once 
it  is  learned.  The  standard  performance  task  is  classification;  the  system  is 
presented  with  new  unknowns  and  is  asked  to  classify  them  as  positive  or 
negative  instances  of  a  concept.  Another  common  task  is  prediction;  if  the 
training  instances  are  successive  elements  of  a  sequence,  the  system  is  asked  to 
predict  future  elements  in  the  sequence.  A  third  task  is  data  compression;  the 
system  is  given  all  possible  instances  (the  full  instance  space)  aud  is  asked  to 
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find  a  concept  that  compactly  describes  them.  The  concept-classification  and 
sequence-prediction  tasks  both  arose  as  laboratory  paradigms  within  cognitive 
psychology  (sec  Hunt,  Marin,  and  Stone,  1966)  Sequence  extrapolation  is  also 
a  paradigm  example  of  induction  as  discussed  by  philosophers  (Carnap,  1950). 
Data  compression  is  of  practical  value  for  storage  and  classification. 

The  two  key  assumptions  made  in  all  of  this  work  arc  (a)  that  the  train¬ 
ing  instances  are  all  examples  (or  counterexamples)  of  a  single  concept  and 
(t>)  that  that  concept  can  be  represented  by  a  point  in  the  given  rule  space. 
When  the  first  assumption  is  violated,  it  is  necessary  to  find  a  set  of  concepts 
that  account  for  the  training  instances.  The  systems  described  in  the  article 
on  multiple  concepts  (Article  XIV. D-t)  address  this  problem.  When  the  second 
assumption  is  violated,  it  is  necessary  to  alter  the  rule  space  so  that  it  Joes 
contain  -he  desired  concept.  Very  little  attention  has  been  given  to  Ibis  prob¬ 
lem  in  single-concept  learning.  The  BACON  program  employs  some  simple 
methods  to  alter  the  rule  space  by  adding  new  terms  to  the  representation 
language  (see  Article  XIV.DSb). 

Approaches  to  Solving  the  Single-concept  Learning  Problem 

In  Article  XIV.D1,  we  described  four  basic  techniques — version  spaces, 
refinement  operators,  generate  and  test,  and  schema  instantiation — that  are 
used  to  search  the  rule  space.  Each  of  these  search  methods  has  been  applied 
to  the  single-concept  learning  problem.  The  remainder  of  this  article  is  divided 
into  four  subarticlcs — one  devoted  to  each  method.  The  first  two  subarticles 
describe  data-driven  methods.  Mitchell’s  version-space  method  is  discussed 
first.  It  provides  a  useful  framework  for  describing  several  related  systems 
developed  by  Hayes-Roth,  Vere,  and  Winston.  Then  two  refinement-operator 
systems,  BACON  and  C1.S/ID3,  are  presented.  The  second  pair  of  subarticlcs 
describes  model-driven  methods:  a  generate-and-test  method  developed  by 
Dicttcrich  and  Michalski  (19.91)  and  a  schema-instantiation  method,  SPAKC, 
that  plays  the  card  game  Elcusis. 

References 
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RECENT  WORK  by  Mitchc-ii  (1977,  1979)  provides  a  unified  framework  for 
describing  systems  that  use  a  data-driven,  single- representation  approach  to 
coacept  learning.  Mitchell  has  noted  that,  in  all  representation  languages,  the 
sentences  can  be  placed  in  a  pa  rtial  order  according  to  the  generality  of  each 
sentence.  Figure  D3a-1  illustrates  this  gencral-to-spccific  ordering  with  a  few 
sentences  in  predicate  calculus  containing  the  predicates  RED  and  BLACK.  The 
concept  3  c\  :  RED(ci),  for  example,  describes  the  set  S  of  all  poker  hands 
that  contain  at  least  one  red  card.  This  concept  is  more  general  than  the 
concept  3  Ci  Cj  :  RED(ci)  A  RED(cj)  that  describes  the  set  T  of  all  poker  hands 
containing  at  least  two  red  cards,  since  the  set  S  strictly  contains  the  set  T. 
The  set  of  cards  described  by  3ciC2Cj  :  RED(ci)  A  RED(cj)  A  8LACK(c3) 
is  smaller  still  and,  thus,  is  even  more  specific  than  the  3  Ci  cj  :  RED(ci)  A 
RED(cj)  concept. 

'  It  should  be  evident  that  the  syntactic  rules  of  generalisation  described  in 
Article  X1V.DI  can  be  used  to  generate  this  partial  ordering.  In  this  example, 
the  dropping-conditions  rule  of  generalisation  was  applied  to  the  three  most 
specific  concepts  to  generate  the  others.  In  general,  •'ny  rule  space  can  be 
partially  ordered  according  to  the  gcneral-to-specific  ordering. 

The  most  general  point  in  the  rule  space  is  usually  the  null  description 
(in  which  ail  conditions  have  been  dropped),  which  places  no  constraints 
on  the  training  instances  and  thus  describes  anything.  The  most  specific 
points  in  the  rule  space  correspond  to  the  training  instances  themselves — 
represented  in  the  same  representation  language  as  that  used  for  tl.c  rule  space 
(ace  Fig.  D3a-2). 


3  Ci  :  KED(ci) 

■x 

3  C|CJ  :  RED(d)  A  RED(cj)  3  CiCj  :  RED(ci)  A  BLACK(ci) 

\ 

3  CiejCj  :  RED(ct)  A  RED(ei)  A  RED{^  .  Xyfcj  :  RED(ct)  A  BI.ACK(c;)  A  BLACK(cj) 


3  ciCjCj  :  hkd(ci)  A  REU(e:)  A  ULACK(cj) 


Figure  D3a-I.  A  small  rule  space  and  its  ge  isral-to-spccific  ordering. 
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null  description  more  general 


training  instances  less  general 

Mitchell  has  pointed  out  that  programs  can  take  advantage  of  this  partial 
ordering  to  represent  the  set  II  of  plausible  hypotheses  very  compactly.  A  set 
of  points  in  a  partially  ordered  set  can  be  represented  by  its  moat  general 
and  moat  specific  elements.  Thus,  as  shown  in  Figure  D3a-3,  the  set  II  of 
plausible  hypotheses  can  be  represented  by  two  subsets:  the  set  of  most  general 
elements  in  II  (called  the  G  set)  and  the  set  of  most  specific  elements  in  II 
(called  the  S  set).  Once  II  has  been  represented  in  this  manner,  the  rules  of 
generalisation  must  be  used  to  fill  in  the  subspace  between  the  G  set  and  the 
5  set  whenever  the  full  II  set  is  needed. 

The  Candidate-elimination  Learning  Algorithm 

Mitchell's  learning  algorithm,  called  the  candidate-elimination  algorithm, 
takes  advantage  of  the  boundary-set  representation  for  the  set  II  of  plausible 


Figure  D3a  -3.  Using  the  boundary  sets  to  represent  a  subspacc  of  the 
rule  space. 
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hypotheses.  Mitchell  defines  a  plausible  hypothesis  as  any  hypothesis  that  has 
not  yet  been  ruled  out  by  the  data.  The  set  II  of  all  plausible  hypotheses  is 
called  the  version  space.  Thus,  the  version  space,  II,  is  the  set  of  all  concept 
descriptions  that  are  consistent  with  all  of  the  training  instances  seen  so  far. 

Initially,  the  version  space  is  the  complete  rule  space  of  possible  concepts. 
Then,  as  training  instances  are  presented  to  the  program,  candidate  concepts 
ore  eliminated  from  the  version  space.  When  it  contains  only  one  candidate 
concept,  the  desired  concept  has  been  found.  The  candidate-elimination 
algorithm  is  a  least-commitment  algorithm,  since  it  does  not  modify  the  set 
II  until  it  is  forced  to  do  so  by  the  training  information.  Positive  instances 
force  the  program  to  generalise — thus,  very  specific  concept  descriptions  are 
removed  from  the  H  set.  Conversely,  negative  instances  force  the  program 
to  specialise,  so  very  general  concept  descriptions  are  removed  from  the  H 
set.  The  version  space  gradually  shrinks  in  this  manner  until  only  the  desired 
concept  description  remains. 

To  see  how  training  instances  force  the  version  space  to  shrink,  consider 
once  again  the  problem  of  teaching  a  program  the  flush  concept  in  poker. 
Suppose  the  program  has  already  seen  the  positive  training  instance 


{(2,  clubs),  (5,  clubs),  (7,  clubs),  [jack,  clubs),  (queen,  ciu6«)}  =#  PLUSH . 

Since  the  candidate-elimination  algorithm  is  a  least-commitment  algorithm,  it 
makes  the  most  specific  possible  assumption  about  the  flush  concept.  Namely, 
it  sets  up  the  5  set  to  contain 

S  mm  {SU1T(C,,  clubs)  A  HANK(C|,  2)  A 
surr(cj,  clubs)  a  rank(c1(  5)  a 
SUIT(cj,  clubs)  A  RANK(cj,  7)  A 
SUIT(c  ,,  clubs)  A  RANK(<\v,  jack)  A 
.SUITES,  clubs)  A  RANK(cj,  ftiern)}  . 

This  hypothesis  is  very  specific  indeed.  It  says  that  there  is  only  one  hand 
that  could  possibly  be  a  flush.  At  the  same  time,  however,  the  candidate- 
elimination  algorithm  makes  the  most  general  possible  assumption,  namely, 
that  every  possible  hand  is  a  flush.  The  G  set  contains  the  null  description. 
This  means  that  the  version  space— the  H  set— of  all  plausible  hypotheses 
contains  S,  G,  and  every  hypothesis  in  between, 

Now,  suppose  the  positive  training  instance 

I 

{(3,  clubs),  (8,  clubs),  (10,  clubs),  (king,  clubs),  (ace,  clubs)}  *»  PLUSH 

is  presented.  The  candidate-elimination  algorithm  realizes  that  its  initial 
assumption  for  the  S  set  was  too  specific — there  pre  other  hands  that  can  be 


•  ' l‘x  \ 
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Hushes.  Thus,  it  is  forced  to  generalize  S  to  contain,  among  other  hypotheses, 

the  rule 

S  =  {sU!T(ci,cfuA*)  A  SUIT(cj,  club*)  A  SU!T(ej,  clubs)  A 
SU1T(c4,  clubs)  A  SUIT(c5,  chibt)}  . 

The  G  set  does  not  change.  Suppose,  however,  that  a  negative  training 
instance 

{(3,  spades), (8,  clubs),  (10,  clubs), (king,  clubs), (acc,  elubt)}  =*  -'FLUSH 

is  presented.  This  forces  the  candidate-elimination  algorithm  to  realize  that 
its  assumption  for  the  G  set,  that  any  hand  could  be  a  flush,  was  wrong.  It 
must  specialize  the  G  set  in  some  way,  so  that  it  does  not  wrongly  classify 
this  hand  as  a  flush. 

In  full  detail,  the  candidate-elimination  algorithm  proceeds  as  follows: 

Step  1.  Initialize  H  to  be  the  whole  space.  Thus,  the  G  set  contains  only 
the  null  description,  and  the  S  set  contains  all  of  the  most  specific 
concepts  in  the  space.  (In  practice,  this  is  not  actually  done  due  to 
the  huge  size  of  .9.  Instead,  the  S  set  is  initialized  to  contain  only 
the  first  positive  example.  Conceptually,  however,  H  starts  -'.ut  as 
the  whole  space.) 

Step  2.  Accept  a  new  training  instance.  If  the  instance  is  a  positive  exam¬ 
ple,  first  remove  from  G  all  concepts  that  do  not  cover  the  new 
example.  Then  update  S  to  contain  all  of  the  maximally  specific 
common  generalizations  of  the  new  instance  and  the  previous  ele¬ 
ments  in  S.  In  other  words,  generalize  the  elements  in  .9  as  little  as 
pa ssible,  so  that  they  will  cover  this  new  positive  example.  This  is 
called  the  Ifpdatc-S  routine. 

If  the  instance  is  a  negative  example,  first  remove  from  S  all  con¬ 
cepts  that  cover  this  counterexample.  Then  update  the  G  set  to 
contain  all  of  the  maximally  general,  common  specializations  of 
the  new  instance  and  the  previous  elements  in  G.  In  other  words, 
specialize  the  elements  in  G  as  little  as  possible  so  that  they  will 
net  cover  this  new  negative  example.  This  is  called  the  Update-G 
routine. 

Step  3.  Repeat  step  2  until  G  =  S  and  this  is  a  singleton  set.  When  this 
occurs,  II  has  collapsed  to  include  only  a  single  concept. 

Step  1.  Output  If  (i.o.,  cither  G  or  S). 

Here  is  an  example  of  a  complete  run  of  the  candidate-elimination  algo¬ 
rithm.  Suppose  we  have  the  following  feature- vector  representation  language: 
The  instance  space  is  a  set  of  objects,  each  object  having  two  features — size 
and  shape.  The  size  of  an  object  can  be  small  or  large,  and  the  shape  of  an 
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Figure  D3a-4.  The  initial  version  space  and  the  general-to-specifie 
partial  order. 

object  can  be  circle,  square,  or  triangle.  Figure  D3a-4  shows  the  entire  rule 
space  for  this  representation  language. 

Each  point  in  the  rule  space  specifies  either  a  variable  or  a  value  for  both 
of  the  features.  If  a  feature  is  specified  by  a  variable,  then  any  value  of  that 
feature  can  be  applied. 

Suppose  we  want  to  teach  the  program  the  concept  of  a  circle.  This  is 
.  represented  as  (x  circle)  where  x  represents  any  size.  First  we  initialize  the 
H  set  to  be  the  entire  rule  space.  This  means  that  the  G  set  is 

G  s*  {(x  y)}, 

representing  the  most  general  possible  concept,  and  the  5  set  is 

S  —  { (snail  square)  (large  square)  (saall  circle)  (large  circle) 
(saall  triangle)  (large  triangle)}. 

Now  wc  present  the  first  training  instance:  a  positive  example  of  the 
concept,  a  small  circle.  The  Update-S  algorithm  is  applied  in  step  2  to  yield: 

G  =  {(xy)} 

S  =»  {(saall  circle)}. 

Figure  D3a-5  shows  the  resulting  version  space.  Solid  lines  connect  con¬ 
cepts  that  are  still  in  the  version  space.  In  practical  implementations  of  the 
candidate-elimination  algorithm,  the  version  space  is  usually  initialized  at  this 
point  rather  than  explicitly  listing  the  entire  instance  space  as  in  the  step 
above. 

The  second  training  instance  is  (large  triangle) — a  negative  example  of 
the  concept.  This  forces  the  G  set  to  be  specialized.  Updatc-C  is  applied  to 
produce 

G  =  {(x  circle)  (email  y)} 

S  =  {(saall  circle)}. 

Figure  D3a-6  shows  the  resulting  version  space. 
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(«n.  (quart)  (Ig.  square)  (am.  circle)  (Ig.  circle)  (sm.  triangle)  (Ig.  triangle) 


Figure  D3a-5.  The  version  space  after  the  first  training  instance. 

Notice  how  the  (x  7)  description  was  specialized  in  two  distinct  ways,  so 
that  it  no  longer  covered  the  negative  example  (large  triangle).  A  third 
possible  specialization  (x  square)  is  not  considered,  since  it  was  removed 
from  the  version  space  during  the  previous  training  instance.  Of  course, 
further  specializations  such  as  (small  circle)  are  not  considered  because  the 
Update-G  algorithm  specializes  as  little  as  possible. 

In  this  case,  the  G  set  grew  larger  as  a  result  of  the  specialization.  The 
Update-G  and  Update-S  algorithms  often  expand  the  size  of  the  G  and  S 
sets.  It  is  the  size  of  these  sets  that  limits  the  practical  application  of  this 
algorithm. 

Finally,  we  present  the  algorithm  with  another  positive  example:  (large 
circle).  Updatc-S  first  prunes  G  to  eliminate  (small  y),  since  it  does  net 
cover  (large  circle).  Then  S  is  generalized,  as  necessary: 

G-  =  {(x  circle)} 

S  —  {(x  circle) }  . 

Since  G  —  S,  the  algorithm  halts  and  prints  (x  circle)  as  the  concept. 

It  is  possible  to  give  intuitive  interpretations  of  the  G  and  S  sets.  The 
set  5  is  the  set  of  sufficient  conditions  for  a  new  example  to  be  an  instance 


(xy) 


(am.  y)  (Ig.  y)  (x  square)  (x  eirele)  (x  triangle) 


(am.  aquare)  (Ig.  aquare)  (»m.  eirele)  (Ig.  circle)  (am.  triangle)  (Ig.  triangle) 


Figure  D3a-8.  The  version  space  after  two  training  instances. 
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of  the  concept.  Thus,  after  the  second  training  instance,  we  know  that  if 
the  new  example  is  a' (saall  circle),  it  is  an  instance  of  the  concept;  (smII 
circle)  is  a  sufCcient  condition  for  positive  classification.  The  set  G  is  the  set 
of  necessary  conditions.  After  the  second  training  instance,  we  know  that  an 
object  either  must  be  a  circle  or  must  be  small  in  order  to  be  an  instance  of  the 
concept.  Neither  of  these  conditions  is  sufficient.  The  algorithm  terminates 
when  the  necessary  conditions  are  equal  to  the  sufficient  conditions — that  is, 
the  algorithm  has  found  a  necessary  and  sufficient  condition. 

It  is  important  to  note  that  the  candidate-elimination  algorithm  conducts 
an  exhaustive,  breadth-first  search  of  the  given  rule  space,  guided  only  by 
the  training  instances.  This  makes  the  algorithm  infcasibly  slow  for  large  rule 
spaces.  The  efficiency  of  the  algorithm  can  be  improved  (at  the  cost  of  possibly 
failing  to  find  the  desired  concept)  by  employing  heuristics  to  prune  the  S  and 
G  sets.  We  postpone  further  discussion  of  the  strengths  and  weaknesses  of 
the  candidate-elimination  algorithm  until  after  we  have  discussed  the  related 
methods  developed  by  Haycs-Roth,  Vcre,  and  Winston. 


Methods  Related  to  the  Version-space  Approach 

Two  learning  methods  similar  to  the  Update-S  procedure  of  the  version- 
space  algorithm  were  developed  prior  to  it.  One  method,  termed  interference 
matching,  was  developed  by  Ilayes-Roth  and  McDermott  (1977,  1978).  The 
other  method,  the  maximal  unifying  generalization  method,  was  developed  by 
Vere  (1975,  1978).  These  methods  can  both  be  viewed  as  implementations 
of  the  Update-S  procedure  with  respect  to  slightly  different  representation 
languages  in  that  they  learn  from  positive  training  instances  only. 

Interference  matching  was  developed  to  discover  concepts  expressed  in 
Hayes-Rolh’s  Parameterized  Structural  Representation  (PSR),  which  is  roughly 
equivalent  to  an  existentially  quantified  conjunctive  statement  in  predicate 
calculus.  Recall  that  Updalc-S  seeks  to  generalize  the  descriptions  in  5 
as  little  as  possible  in  order  to  cover  each  new  positive  training  instance. 
When  the  descriptions  are  represented  as  predicate  calculus  expressions,  this 
is  equivalent  to  finding  the  largest  common  subexpressions,  because  the  largest 
common  subexpression  is  that  subexpression  for  which  the  fewest  conjunctive 
conditions  need  to  be  dropped.  As  an  example,  suppose  that  the  set  S  contains 
the  description 

S  =  {BI.OCK(*)  A  BLOCK(y)  A  RECTANCLE(x)  A  ONTOP(x,  y)  A  SQUARE(y)} 
and  the  next  positive  training  instance  (/i)  is 

/l  =  BLOCK(u>)  A  BI.OCK(v)  A  SQUARE(w)  A  ONTOP(u>,  v)  RECTANCLE(w). 
Update-S  will  produce  the  following  common  subexpressions: 

S'  a*  {si.Jj}, 
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where  st  =■  C!,OCK(a)  A  BLOCK{6)  A  SQUARE(u)  A  RECTANCLE(6),  and  s2  = 
BLOCK(c)  A  BLOCK(d)  A  ONTOP(c,  d) . 

The  si  description  corresponds  to  the  hypothesis  that  the  ONTOP  rela¬ 
tion  is  irrelevant  to  the  concept.  The  3j  description,  on  the  other  hand, 
corresponds  to  the  hypothesis  that  the  shapes  of  the  objects  involved  are 
irrelevant.  Notice  that  there  is  no  consistent  way  to  match  1 i  to  S  that 
preserves  a  one-to-one  correspondence  of  the  variables  x  and  y  with  u.  and  v; 
either  the  rectangle  and  square  predicates  conflict  (e.g.,  when  x  is  matched 
with  u»)  or  else  the  order  of  the  arguments  to  ONTOP  conflict  (e.g.,  when  *  is 
matched  to  v). 

The  interference-matching  algorithm  starts  out  as  a  breadth-first  search 
of  all  possible  matchings  of  one  PSR  with  another.  The  search  proceeds  by 
“growing”  common  subexpressions  until  a  space  limit  is  reached.  Unpromising 
matches  are  then  pruned  with  a  heuristic  utility  function,  and  the  growing 
process  continues  in  a  more  depth-first  fashion.  The  utility  of  a  partial  match 
is  equal  to  the  number  of  predicates  matched  less  the  number  of  variables 
matched.  If  the  space  limit  is  approximately  the  same  as  the  largest  com¬ 
mon  subexpression,  the  algorithm  becomes  truly  depth-first,  since  only  one 
subexpression  “fits”  within  the  space  limit.  Thus,  the  interference-matching 
algorithm  tends  to  find  one  good  common  subexpression  rather  than  finding 
all  maximal  common  subexpressions  (as  in  the  Updale-S  algorithm). 

Vere’s  algorithm  for  finding  the  maximal  unifying  generalization  of  two 
Grat-ord'''  predicate-calculus  descriptions  is  very  similar  to  the  interference- 
matching  algorithm.  The  representation  language  used  by  Vere,  however, 
permits  a  many-to-one  binding  of  parameters  during  the  matching  process 
(Vere,  1975).  Verc’s  method  aiso  conducts  a  breadth-first  search  of  possible 
matchings  but  does  not  do  any  pruning  of  this  search. 

Winston 's  Work  on  Learning  Structural  Descriptions  from  Examples 

Winston’s  (1970)  influential  work  on  structural  learning  served  as  a  precur¬ 
sor  to  the  other  learning  methods  described  above.  The  method  has  the 
same  basic  data-driven  approach  as  in  the  version-space  and  related  algo¬ 
rithms:  Training  instances  are  accepted  one  at  a  time  and  matched  against 
the  concept  descriptions  in  the  set  fl.  Unlike  those  breadth-first  algorithms 
(e.g.,  Update-S  and  Update-G),  however,  Winston's  system  conducts  a  depth- 
first  search  of|the  concept  space.  Instead  of  maintaining  a  set  of  plausible 
hypotheses,  Winston’s  program  uses  the  training  instances  to  update  a  single 
current  concept  description.  This  description  contains  all  of  the  program’s 
knowledge  aboiit  the  concept  being  learned. 

The  task  of  the  program  is  to  learn  concept  descriptions  that  charac¬ 
terize  simple  tojr-block  constructions.  The  toy-block  assemblies  arc  initially 
presented  to  the  computer  as  line  drawings.  A  knowledge-based  interpretation 
program  converts  these  line  drawings  into  a  semantic-network  description. 
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Winston  aJso  uses  this  semantic-network  representation  to  describe  the  cur¬ 
rent  concept  and  some  background  knowledge  about  toy  blocks. 

Figure  D3a-7  shows  a  line  drawing  of  an  arch  and  the  corresponding 
semantic  network.  The  network  is  roughly  equivalent  to  the  predicate-calculus 
expression 

ONE-PART-IS(areA,  a)  A  ONE-PART-IS(areA,  6)  A 
ONE-PART-IS(areA,  c)  A  UAS-PROPERTY-OF(a,  lying)  A 
A-KIND-OF(a,  object)  A  MUST-BE-SUPPORTED-BY(a,  b)  A 
MUST-BE-SUPP0RTED-3Y(a,  c)  A  MUST-NOT-AJ3UT(6,  c)  A 
MUST-NOT-ABUT(c,  b)  A  EEFT-0F(6,  c)  A  RIGIlT-OF(c,  6)  A 
HAS-PROPERTY-OF(6,  standing)  A  lIAS-PROFERTY-OF(c,  standing)  A 
A-KIND-OF(6,  brick)  A  A-KlND-OF(c,  brick) , 

along  with  statements  of  blocks-world  knowledge  such  as 

A-KIND-OF(AneA,  object) 

A-KIND  -OF (standing,  property) 

and  statements  relating  difTerenw  predicates  in  the  .epresentation  language, 
such  as 

opposites(must-ahut,  must-not-abut) 

MUST-FORM-OF(lS-SUPPORTED-BY,  MUST-BE-StIPPORTED-BY)  . 

A  distinctive  aspect  of  Winston's  concept  representation  is  that  it  allows 
necessary  conditions  to  be  represented  explicitly.  For  example,  the  condition 
that  in  an  arch  the  posts  must  not  touch  can  be  directly  represented  by  a 
MUST-NOT-ABUT  link.  This  allows  Winston's  program  to  express  necessary 
and  sufficient  conditions  in  one  combined  network  structure. 

Winston's  learning  algorithm  works  as  follows: 

Step  1.  Initialise  the  current  concept  description,  H,  to  be  the  network 
corresponding  to  the  Grst  positive  training  instance. 

Step  2.  Accept  a  new  line  drawing  and  convert  it  into  a  semantic-network 
representation. 

Step  3.  Match  the  training  instance  with  ll  (using  a  graph- matching  algo¬ 
rithm)  to  obtain  the  common  skeleton.  The  skeleton  is  a  maximal 
common  subgraph  of  the  two  graphs.  Annotate  the  skeleton  by 
attaching  comments  indicating  those  nodes  and  links  that  did  not 
match. 

Step  d.  Use  the  annotated  skeleton  to  decide  how  to  modify  the  current 
concept  description  H. 
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Figure  D3a-7.  A  training  instance  and  its  internal  representation. 
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If  the  new  instance  is  a  positive  example  of  the  concept,  then 
generalise  H  as  necessary.  The  algorithm  generalises  either  by 
dropping  nodes  and  links  or  by  replacing  one  node  (e.g.,  cube)  by  a 
more  general  node  (e.g.,  brick).  In  some  cases,  the  algorithm  must 
choose  between  these  two  generalisation  techniques.  The  program 
chooses  the  less  drastic  method  (node  replacement)  and  places  the 
other  choice  on  a  backtrack  list. 

If  the  new  instance  is  a  negative  example  of  the  concept,  a  necessary 
condition  (represented  by  a  suit-link)  is  added  to  H.  If  there  are 
several  differences  between  the  negative  training  instance  and  //, 
the  algorithm  applies  some  ad  hoc  rules  to  choose  one  difference 
to  “blame"  for  causing  the  instance  to  be  a  negative  instance. 

This  difference  is  converted  into  a  necessary  condition.  The  other 
differences  are  ignored. 

Repeat  steps  2,  3,  and  4  until  the  teacher  halts  the  program. 

Since  the  algorithm  searches  in  depth-first  fashion,  it  is  possible  for  con¬ 
tradictions  to  arise  in  step  4.  For  example,  after  seeing  a  negative  training 
instance  such  as  shown  in  Figure  D3a-8,  the  algorithm  might  assume  in  step  4 
that  the  r  -ason  this  is  not  an  arch  is  the  triangular  lintel  rather  than  the  fact 
that  the  posts  are  touching.  Subsequently,  when  the  program  sees  the  positive 
instance  shown  in  Figure  D3a-9,  a  contradiction  arises.  When  this  happens, 
the  system  backtracks  to  the  last  point  at  which  a  choice  was  made,  and  the 
algorithm  makes  a  new  choice. 

This  learning  algorithm  is  somewhat  weak  and  ad  hoc,  since  it  does  not 
concern  itself  cither  with  the  possibility  that  the  training  instance  matches 
H  in  multiple  ways  or  with  the  problem  that  there  are  multiple  ways  of 
generalizing  or  specializing  H.  Winston  makes  two  important  assumptions 
that  allow  this  algorithm  to  ignore  these  problems.  First,  it  is  assumed 
that  the  training  instances  are  presented  in  good  pedagogical  order,  so  that 
contradictions  and  choice-points  arc  unlikely  to  arise;  the  teacher  is  assumed 
to  have  chosen  the  examples  so  as  to  vary  only  one  aspect  of  the  concept  in 
each  example.  The  second  assumption  is  that  the  negative  training  instances 


Figure  D3a-8.  A  near-miss  negative  example  of  an  ARCH. 
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Figure  D3a  9.  A  positive  example  of  an  ARCH. 

are  all  near  muses,  that  is,  instances  that  ;  it  barely  fail  to  be  examples  of 
the  concept  in  question.  These  two  assumptions  permit  the  learning  system 
to  perform  fairly  well  in  the  domain  of  toy-block  concepts. 

Weaknesses  of  the  Version-space  Approach  (and  Related  Approaches) 

There  arc  several  weaknesses  in  these  methods  that  limit  their  practi¬ 
cal  application.  This  section  discusses  these  problems  and  examines  some 
proposed  solutions. 

Noisy  training  instances.  A'  with  all  data-driven  algorithms,  these 
methods  have  dilliculty  with  noisy  training  instances.  Since  these  algorithms 
seek  to  find  a  concept  description  that  is  consistent  with  all  of  the  train¬ 
ing  instances,  any  single  bad  instance  (i.e.,  a  false  positive  or  false  negative 
instance)  can  have  a  big  effect.  When  the  candidate  elimination  algorithm  is 
given  a  false  positive  instance,  for  example,  the  5  set  becomes  overly  general¬ 
ised.  Similarly,  a  false  negative  instance  causes  the  G  set  to  become  overly 
spccialixed.  Eventually,  noisy  training  instances  can  lead  to  a  situation  in 
which  there  are  no  concept  descriptions  that  are  consistent  with  all  of  the 
training  instances.  In  such  rases,  the  G  set  “passes”  the  S  set,  and  the  ver¬ 
sion  space  of  consistent  concept  deserptions  becomes  empty.  The  methods 
of  Hayes-Roth,  Vere,  ami  Winston  also  overgencralixe  in  the  presence  of  false 
positive  training  instances. 

In  order  to  learn  in  the  presence  of  noise,  it  is  necessary  to  relax  the 
condition  that  the  roncept  desci  .ptions  be  consistent  with  all  of  the  training 
inslaiK-'A.  One  solution,  proposed  by  Mitchell  (1978),  is  to  maintain  several  S 
and  G  is  of  varying  consistency.  The  set  So,  for  example,  is  consistent  with 
all  of  the  positive  examples,  and  the  set  S\  is  consistent  with  all  but  one  of 
the  |H»silivc  examples.  In  general,  each  description  in  the  set  S,  is  consistent 
with  all  but  t  of  the  positive  training  instances.  Similarly,  each  description 
in  the  set  G,  is  consistent  with  all  but  »  of  the  negative  training  instances. 
Figure  I)3a-I0  gives  a  schematic  diagram  of  these  sets.  Mitchell  provides  a 
fairly  elFicient  algorithm  for  updating  these  multiple  boundary  sets. 
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Figure  D3a-10.  The  multiple- boundary  set  technique. 

When  Gq  crosses  So,  the  algorithm  can  conclude  that  no  concept  in  the 
rule  space  is  consistent  with  all  of  the  training  instances.  The  algorithm  can 
recover  and  try  to  find  a  concept  that  is  consistent  with  all  but  one  of  the 
training  instances.  If  that  fails,  it  can  look  for  a  concept  consistent  with 
all  but  two  instances,  and  so  forth.  This  approach  to  error  recovery  works 
for  learning  problems  containing  a  few  erroneous  training  instances,  but  it 
requires  a  large  amount  of  memory  to  store  all  of  the  5  and  G  boundary  sets. 

Disjunctive  concepts.  A  second,  important  weakness  of  these  data- 
driven  algorithms  is  their  inability  to  discover  disjunctive  concepts.  Many 
concepts  have  a  disjunctive  form.  For  instance,  an  uncle  is  either  the  brother 
of  a  parcut  or  the  spouse  of  a  sister  of  a  parent: 

UNCl.E(x)  =  BROTIIER(PARENT(x))  V 

uncle(x)  =  spouse(sister(parent(x))). 

Parent  itself  might  be  expressed  disjunctively  as  PARENT(x)  —  FATHER(x)  V 
PARENT(x)  MOTHER(x).  However,  if  disjunctions  of  arbitrary  length  are 
permitted  in  the  representation  language,  the  data-driven  algorithms  described 
above  never  generalize.  In  the  candidate-elimination  algorithm,  for  example, 
the  S  set  will  always  contain  a  single  disjunction  of  all  of  the  positive  train¬ 
ing  instances  seen  so  far.  This  is  because  the  least  generalization  of  a  new 
training  instance  and  the  current  S  set  is  simply  the  disjunction  of  the  new 
instance  with  the  S  set.  Similarly,  the  G  set  will  contain  the  disjunction  of 
the  negation  of  each  of  the  negative  training  instances.  Unlimited  disjunction 
allows  the  partially  ordered  rule  space  to  become  infinitely  “branchy." 
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The  basic  difficulty  is  that  all  of  those  algorithms  arc  least-commitment 
algorithms  that  generalize  only  wl'.en  they  arc  forced  to.  Disjunction  provides 
a  way  of  avoiding  any  generalization  at  ail— so  the  algorithms  are  never  forced 
to  generalize.  In  order  to  develop  a  useful  technique  fo.  learning  disjunctive 
concepts,  some  method  must  be  found  for  controlling  the  introduction  of 
disjunctions.  The  learning  algorithms  must  be  guided  toward  generalizing  in 
certain  ways  to  exclude  the  trivial  disjunction. 

One  solution  (proposed  in  different  forms  by  Michalski,  1°3I),  and  by 
Mitchell,  1978)  is  to  employ  a  representation  language  that  does  not  contain 
a  disjunction  operator  and  to  perforin  repeated  candidate-elimination  runs 
to  find  several  conjunctive  descriptions  that  together  cover  all  of  the  train¬ 
ing  instances.  Wc  repeatedly  find  a  conjunctive  concept  description  that  is 
consistent  with  some  of  the  positive  training  instances  and  all  of  the  nega¬ 
tive  training  instances.  The  positive  instances  that  have  been  accounted  for 
are  removed  from  further  consideration,  and  the  process  is  repeated  until  all 
positive  instances  have  been  covered: 

Step  1.  Initialize  the  5  set  to  contain  one  positive  training  instance.  G  is 
initialized  to  the  null  description — the  most  general  concept. 

Step  2.  I'or  each  negative  ti  .lining  instance,  apply  the  Updatc-0  algorithm 
to  G . 

Choose  a  description  g  from  G  as  one  conjunction  for  the  solution 
set.  Since  Update-G  has  been  applied  using  all  of  the  negative 
instances,  g  covers  no  negative  instances.  However,  g  may  cover 
several  of  the  positive  instances.  Remove  from  further  considera¬ 
tion  all  positive  training  instances  that  are  more  specilic  than  g. 

Repeat  steps  1  through  3  until  all  positive  training  instances  are 
covered. 

This  process  builds  a  disjunction  of  descriptions  that  covers  all  bf  the  data. 
It  tends  to  lind  a  disjunction  containing  only  a  few  conjunctive  terms. 
Figure  D3a-ll  is  a  schematic  diagram  of  how  this  process  works. 

The  point  J|  is  the  lirst  positive  ti .lining  instance  selected  in  step  1.  After 
all  of  the  negative  instances  have  been  processed  with  Updatc-G,  ij\  is  selected 
from  the  G  set  in  step  3.  Notice  that  </i  covers  several  positive  instances  in 
addition  to  s(,  but  that  not  all  positive  instances  are  yet  covered.  The  point  aj 
is  then  chosen  and  g<  is  developed.  Similarly,  .i  j  is  chosen  and  <j.\  is  developed. 
As  the  ligurc  shows,  the  conjunctive  roncepls,  g,,  need  not  be  disjoint.  Also, 
the  set  of  concepts  ij,  that  is  obtained  by  this  procedure  varies  depending  on 
the  order  in  which  the  positive  training  instances  arc;  selected  in  step  1. 

An  algorithm  very  similar  lo  this,  called  the  A'1  algorithm,  w;is  developed 
by  Michalski  (19(19,  1975)  for  use  with  an  extended  propositional  calculus 
representation.  I  he  A'1  algorithm  makes  use  o''  an  additional  heuristic  in 


Step  3. 
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Figure  D3a-ll.  Schematic  diagram  of  an  iterative  version-space  algorithm 
for  finding  disjunctive  concepts. 

step  1.  It  selects  as  a  “seed”  positive  training  instance  one  that  has  not 
been  covered  by  any  description  in  any  previous  G  set.  This  has  the  effect 
of  choosing  training  instances  that  are  “far  apart”  in  the  instance  space. 
Larson  (1977)  elaborated  A9  to  apply  it  to  an  extended  predicate-calculus 
representation. 

The  effect  of  this  iterative  version-space  approach  is  to  find  a  description 
with  virtually  the  fewest  number  of  disjunctive  terms.  Finding  such  a  descrip¬ 
tion  is  not  always  desirable.  Programs  searching  for  symmetrical  descriptions, 
for  example,  may  hypothesize  a  disjunctive  term  for  which  there  is,  as  yet,  no 
evidence.  Consider  how  a  program  would  learn  the  direction  of  wind  rotation 
about  a  weather  system.  After  seeing  the  following  two  training  instances 

Instance  t.  HEMISPHERE  *  norM  A  PRESSURE  =  high 
=*  ROTATION  =  clockwise 

Instance  2.  HEMISPHERE  =«  south  A  PRESSURE  -m  high 
=»*  ROTATION  countered  ~ise , 
the  program  might  hypothesize  that 

HEMISPHERE  =*  nor(A  A  PRESSURE  —  high  V 
HEMISPHERE  —  south  A  PRESSURE  =  low 
=>*  ROTATION  =  clockwise  , 

even  though  the  simplest  hypothesis  would  be 

HEMISPHERE  =  north  =>  ROTATION  =  clockwise. 

The  problem  of  learning  disjunctive  concepts  is  still  largely  unexamined 
by  A!  researchers. 
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Mitchell  (1977,  1979)  provides  good  descriptions  of  the  version-space  ap¬ 
proach.  Haycs-Roth  and  McDermott  (1978),  Vcre  (1975),  and  Winston  (1970) 
present  detailed  descriptions  of  their  methods.  See  Dietterich  and  Michalski 
(1981)  for  a  critical  comparison  of  these  methods. 
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TlIE  SECOND  FAMILY  of  data-driven  methods  does  not  employ  partial  match¬ 
ing  to  search  the  rule  space.  Instead,  these  methods  develop  a  set  of  hypotheses 
in  a  rule  space  that  is  separate  from  the  instance  space  (i.e.,  the  single- 
representation  trick  is  not  used).  The  hypotheses  are  modified  by  refinement 
operators,  which  r. re  selected  by  heuristics  that  inspect  the  training  instances. 
The  following  is  a  general  outline  of  these  operator- based  algorithms: 

Step  1.  Gather  some  training  instances. 

Step  2.  Analyse  the  instances  to  decide  which  rule-space  operator  to  apply. 

Step  3.  Apply  the  operator  to  make  some  change  in  the  current  set,  H,  of 
hypotheses. 

Repeat  steps  1  through  3  until  satisfactory  hypotheses  are  obtained. 

In  this  article,  two  systems  are  described  that  use  this  technique:  BACON  and 
CLS. 

BACON 

BACON  is  a  set  of  concept-learning  programs  developed  by  Pat  Langley 
(1977,  1980).  These  programs  solve  a  variety  of  single-concept  learning  tasks, 
including  “rediscovering”  such  classical  scientific  laws  as  Ohm’s  law,  Newton’s 
law  of  universal  gravitation,  and  Kepler’s  law.  The  programs  are  also  capable 
of  using  the  learned  concepts  to  predict  future  training  instances. 

The  idea  underlying  BACON  is  simple:  The  program  repeatedly  exam¬ 
ines  the  data  and  applies  its  refinement  operators  to  create  new  terms.  This 
continues  until  it  finds  that  one  of  these  terms  is  always  constant.  A  single 
concept  is  thus  represented  in  the  form  term  =  constant  value. 

BACON  uses  a  feature-vector  representation  to  describe  each  training 
instance.  A  distinguishing  aspect  is  that  the  features  may  take  on  continuous 
real  values  as  well  as  discrete  symbolic  or  numeric  values.  For  example, 
suppose  we  want  BACON  to  discover  Kepler's  law:  The  period  of  a  planet’s 
revolution  around  the  sun,  p,  is  related  to  its  distance  from  the  sun,  d,  as 
di/pi  =  k,  for  some  constant  k.  First,  BACON  is  supplied  with  training 
instances  of  the  form:. 

Features 

Instance  Planet  p  d 

I\  Mercury  1  -  1 

It  Venus  8  4 

It  Garth  27  9 
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BACON  is  told  that  p  and  d  are  dependent  on  the  value  of  the  planet 
variable.  Once  BACON  has  gathered  a  few  training  instances,  it  examines 
them  to  sec  if  any  of  its  rule-space  operators  are  triggered.  In  this  case,  since 
p  and  d  arc  both  increasing  and  are  not  linearly  related,  an  operator  that 
creates  the  new  term  d/p  is  triggered.  This  rule-space  operator  is  executed, 
and  the  training  instances  arc  reformulated  to  give: 


Features 


Instance 

Planet 

P 

d 

d/p 

h 

Mercury 

1 

1 

i.O 

h 

Venus 

S 

4 

.5 

h 

Earth 

27 

9 

.33 

Again,  BACON  checks  to  see  if  any  of  its  rule-space  operators  are  trig¬ 
gered.  This  time,  the  product  operator  is  executed  to  create  the  term  (d/p)d, 


since  d  and  d/p  are 

varying  inversely. 

The  data 

Features 

are  reformulated  to  give: 

Instance 

Planet 

P 

d 

d/p 

da/p 

/. 

Mercury 

1 

1 

i.o  ! 

t.O 

h 

Venus 

8 

4 

.5  ; 

2.0 

h 

Earth 

27 

9 

.33 

3.0 

On  the  third  iteration,  BACON  again  checks  to  see  if  -ny  operators  apply. 
The  product  operator  is  again  triggered  to  create  the  term  (d/ p)(d2  /p).  The 
data  are  reformulated  to  give:  I 


Features 


Instance 

Planet 

P 

<i  . 

d/p 

d2/p 

d*/p 

h 

Mercury 

1 

1 

1.0 

1.0 

1.0 

h 

Venus 

8 

4 

.5 

2.0 

1.0 

h 

Earth 

27 

9 

.33 

3.0 

1.0 

BACON  examines  these  data,  and  its  constancy  operator  is  triggered  to 
create  the  hypothesis  that  the  d3/p2  term  is  constant.  BACON  then  gathers 
more  data  to  test  this  hypothesis  before  it  halts. 

BACON’s  Rule-space  Operators 

The  various  BACON  programs  have  different  rule-space  operators.  ICach 
operator  is  stored  as  a  production  rule,  of  which  the  left-hand  side  performs 
extensive  tests  to  search  for  possible  patterns  in  the  data  and  the  right-hand 
side  creates  the  new  terms.  Here  is  a  brief  survey  of  the  operators  implemented 
in  the  BACON. 1  program: 
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1.  Constancy  detection.  This  operator  is  triggered  when  some  dependent 
variable  takes  on  the  same  value,  v,  at  least  two  times.  It  creates  the 
hypothesis  that  this  variable  is  always  constant  with  value  v. 

2.  Specialization.  This  operator  is  triggered  when  a  previously  created 
hypothesis  is  contradicted  by  the  data.  It  specialises  the  hypothesis  by 
adding  a  conjunctive  condition. 

3.  Slope  and  intercept  term  creation.  This  operator  detects  that  two  variables 
are  varying  together  linearly  and  creates  new  terms  for  the  slope  and 
intercept  of  this  linear  relation. 

4.  Product  creation.  This  operator  detects  that  two  variables  are  varying 
inversely  without  a  constant  slope.  It  creates  a  new  term  that  is  the  ' 
product  of  the  two  variables. 

5.  Quotient  creation.  This  operator  detects  that  two  variables  are  vary¬ 
ing  monotonically  (increasing  or  decreasing)  without  constant  slope.  It 
creates  a  new  term  that  is  the  quotient  of  the  two  variables. 

6.  Modulo- n  term  creation.  This  operator  notices  that  one  variable,  »j,  takes 
on  a  constant  value  whenever  an  independent  variable,  vj,  has  a  certain 
value  modulo  n.  The  new  term  vj-modulo-n  is  created.  Only  small  values 
of  n  are  considered. 

Extension*  to  BACON 

DACON.2  is  an  extended  version  of  BACON.  1  that  includes  two  additional 
operators  for  detecting  recurring  sequences  and  for  creating  polynomial  terms 
by  calculating  repeated  differences.  BACON. 2  can  solve  a  larger  class  of 
sequence  extrapolation  tasks  as  a  result. 

BACON.3  is  another  extension  of  BACON.  I  that  uses  hypotheses  proposed 
by  the  constancy-detection  operators  to  reformulate  the  training  instances. 
For  BACON.3  to  discover  the  ideal  gas  law  (PV/NT  is  equal  to  a  constant), 
for  example,  it  is  given  the  following  training  instances: 


Instance 

V 

Features 

P 

T 

N 

h 

.0083200 

300,000 

300 

1 

h 

.0062400 

400,000 

300 

1 

h 

.0049920 

500,000 

300 

1 

I* 

.0085973 

300,000 

310 

1 

h 

.0064480 

400,000 

310 

1 

h 

.0051584 

500,000 

310 

1 

h 

.0088747 

300,000 

320 

1 

h 

.0066560 

400,000 

320 

1 

Is 

.0053248 

500,000 

320 

1 
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Features 


Instance 

V 

P 

T 

N 

hi 

.0266240 

300,000 

320 

3 

1 26 

.0199680 

400,000 

320 

3 

hi 

.0150740 

500,000 

320 

3 

By  applying  the  product-creation  operator  followed  by  the  constancy- 
detection  operator,  BACON  develops  the  hypothesis  that  PV  is  constant  for 
particular  values  of  N  and  T.  This  hypothesis,  which  BACON  must  rediscover 
for  each  particular  value  of  N  and  T,  is  used  to  recast  the  data  to  give  the 
following  derived  training  instances: 


Features 


Instance 

PV 

T 

JV 

2,496 

300 

1 

I'2 

2,579.1999 

310 

1 

I'z 

2,662.3999 

320 

1 

l\ 

4,991.9999 

300 

2 

I's 

5,158.3999 

310 

2 

5,324.7999 

320 

*> 

ta 

/'r 

7,488 

300 

3 

A 

7,737.5999 

310 

3 

I't 

7,987.2 

320 

3 

Each  of  these  derived  instances  results  from  collapsing  three  of  the  original 
training  instances.  Thus,  /',  is  derived  by  noticing  that  PV  takes  on  the 
constant  value  2,496  in  / 1,  I2,  and  i3.  By  applying  the  slope-intercept  operator 
to  these  derived  instances,  BACON  develops  the  hypothesis  that  PV j  T  is 
constant  for  particular  values  of  N.  It  uses  this  hypothesis  to  recast  the 
training  instances  into  the  following  form: 


Features 


Instance 

PV/T 

N 

I'l 

8.32 

1 

16.64 

2 

t'i 

24.95 

3 

By  applying  the  slope-intercept  operator  to  these  doubly  derived  instances, 
BACON  develops  the  hypothesis  that  PV/ NT  is  constant  and,  thus,  posits  the 
ideal  gas  law. 
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BACON’s  Rule  Space 

What  is  the  rule  space  that  BACON  is  searching?  BACON  expresses 
hypotheses  as  feature  vectors,  some  of  whose  values  are  omitted  (i.e.,  turned 
to  variables).  For  example,  Kepler's  law  is  expressed  as 

Features:  Planet  p  d  d/p  d*/p  d?jp 1 

Values:  -  -  -  —  -  1.0 

Thus,  the  rule  space  is  the  space  of  such  feature  vectors  whose  features  are 
any  terms  that  BACON  can  create  with  its  operators. 

BACON  conducts  a  sort  of  depth-first  search  through  this  space.  The 
conditions  under  which  the  operators  are  triggered  are  quite  specialised.  The 
constancy-detection  operator,  for  example,  only  checks  the  values  of  the 
most  recently  created  dependent  variable  against  the  most  recently  varied 
independent  variable.  Most  of  the  other  operators  are  invoked  under  similarly 
constrained  conditions. 


Strengths  and  Weaknesses  of  BACON 

BACON's  primary  strength  is  its  ability  to  discover  simple  laws  relating 
real-valued  variables.  Also  of  interest  is  BXCON’s  use  of  rule-space  operators 
to  create  new  terms  as  combinations  of  existing  terms.  Further,  the  BACON.3 
strategy  of  reformulating  the  training  instances  when  partial  regularities  are 
discovered  may  be  important  for  future  learning  programs.  Simon  (1979)  has 
discussed  BACON  as  a  model  of  data-driven  theory  formation  in  science. 

There  are  some  difficulties  with  the  present  BACON  programs,  however. 
First,  the  fact  that  the  operators  are  evoked  only  under  highly  specialized 
conditions  causes  the  program  to  be  sensitive  to  the  order  of  the  variables  and 
to  the  particular  values  chosen  for  the  training  instances.  For  some  sets  of 
training  instances,  for  example,  BACON  is  unable  to  discover  Ohm’s  law  (see 
Langley,  1980,  p.  104).  It  is  necessary  to  adjust  the  order  of  the  variables  and 
the  particular  training  instances  to  get  BACON  to  discover  concepts  efficiently. 
For  example,  when  BACON  is  discovering  the  pendulum  law,  40%  more  time 
is  required  if  the  variables  are  poorly  ordered.  Similarly,  it  cannot  handle 
irrelevant  variables  well. 

Second,  BACON  is  unable  to  handle  noisy  training  instances.  The  trig¬ 
gering  of  the  constancy  detectors,  for  example,  is  based  on  the  near  equality 
of  the  values  seen  in  as  few  as  two  training  instances.  Such  calculations  are 
highly  sensitive  to  noise.  The  slope  detectors  are  similarly  sensitive. 

Third,  BACON  can  handle  only  relatively  simple  concept-formation  tasks 
involving  nonnumeric  variables.  The  program  cannot,  for  example,  discover 
concepts  that  involve  internal  disjunction  (such  as  the  concept  of  a  red  or 
green  cube).  It  is  also  unable  to  discover  the  simple  concept  underlying  the 
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letter  sequence  ABTCDSEFR  . . .  and  similar  sequences  appearing  in  Kotovsky 
and  Simon  (1973). 

In  summary,  BACON  is  interesting  primarily  for  its  use  of  rule-space 
operators  to  create  product,  quotient,  slope,  and  intercept  terms  and  for  its 
ability  to  recast  the  training  instances  on  the  basis  of  developed  hypotheses. 


CIS/IDS 

CLS  (Concept  Learning  System)  is  a  learning  algorithm  devised  by  Earl 
Hunt  (see  Hunt,  Marin,  and  Stone,  1966).  It  is  intended  to  solve  single- 
coueept  learning  tasks  and  uses  the  learned  concepts  to  classify  new  instances. 
A  more  recent  version  of  the  CLS  algorithm,  ID3,  was  developed  by  Ross 
Quinlan  (1979,  in  press).  In  this  artirle,  we  discuss  the  ID3  algorithm  and  its 
application  to  data  compression  and  concept  formation. 

Like  BACON,  ID3  uses  a  feature-vector  representation  to  describe  the 
training  instances.  The  features  must  each  have  only  a  small  number  of  pos¬ 
sible  discrete  values.  Concepts  are  lepresented  as  decision  trees.  For  example, 
if  the  features  of  size  (small,  large),  shape  (circle,  square,  and  triangle),  and 
color  (red,  blue)  are  used  to  represent  the  training  instances,  the  concept  of  a 
red  circle  (of  any  size)  could  be  represented  as  the  tree  shown  in  Figure  D3b-1. 

An  instance  is  classiGcd  by  starting  at  the  root  of  the  tree  and  making 
tests  and  following  branches  until  a  node  is  arrived  at  that  indicates  the  class 
as  YES  or  NO  (see  Article  XI.l)).  For  example,  the  instance  {large,  circle,  blue) 
is  classified  as  follows.  Starting  with  the  root  node  (shape),  we  follow  the 
circle  branch  to  the  color  node.  From  the  color  node  we  take  the  blue  branch 
to  a  NO  node  indicating  that  this  instance  is  not  an  instance  of  the  concept 
of  a  red  circle. 

Decision  trees  are  inherently  disjunctive,  since  each  branch  leaving  a  deci¬ 
sion  node  corresponds  to  a  separate  disjunctive  case.  The  tree  in  Figure  D3b-1, 


Figure  D3b-1.  Decision  tree  for  the  concept  of  a  red  circle. 
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for  example,  is  equivalent  to  the  predicate  calculus  expression: 

->SHAPE(x,  triangle)  V  -’SHAPE(x,  square)  V 
SHAPE(x,  circle)  A  (COLOR(x,  red)  V  -’COLOR(x,  4/ue)] . 

Consequently,  decision  trees  can  be  used  to  represent  disjunctive  concepts 
such  as  large  circle  or  email  square  (see  Fig.  D3b-2). 

A  drawback  of  decision  trees  is  that  there  arc  many  possible  trees  cor¬ 
responding  to  any  single  concept.  This  lack  of  a  unique  concept  representation 
makes  it  difficult  to  check  that  two  decision  trees  are  equivalent. 

The  CLS  Learning  Algorithm  {as  Used  in  ID!) 

The  CLS  algorithm  starts  with  an  empty  decision  tree  and  gradually 
refines  it,  by  adding  decision  nodes,  until  the  tree  correctly  classilics  ail  of  the 
training  instances.  The  algorithm  operates  over  a  set  of  training  instances,  C, 
as  follows: 

Step  1. 


Step  2. 
Step  3. 


If  all  instances  in  C  are  positive,  then  create  a  YES  node  and  halt. 
If  all  instances  in  C  are  negative,  create  a  NO  node  and  halt. 
Otherwise,  select  (using  some  heuristic  criterion)  a  feature,  F,  with 
values  oi ,  . . . ,  t»„  and  create  the  decision  node: 


Partition  the  training  instances  in  C  into  subsets  Ci,Ct,  ...,C» 
according  to  the  values  of  V. 

Apply  the  algorithm  recursively  to  each  of  the  sets  C%. 


Figure  D3b-2.  Decision  tree  for  a  disjunctive  concept. 
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The  criterion  used  in  step  1  by  ID3  is  to  choose  the  feature  that  best  dis¬ 
criminates  between  positive  and  negative  instances.  Hunt  ct  ai.  (1966)  describe 
several  methods  for  estimating  which  feature  is  the  most  discriminatory. 
Quinlan  chooses  the  feature  that  leads  to  the  greatest  reduction  in  the  esti¬ 
mated  entropy  of  information  of  the  training  instances  in  C.  The  exact  crite¬ 
rion  is  to  choose  the  feature  F  (with  values  vi,uj,  ...,un)  that  minimises 


where  V*  is  the  number  of  positive  instances  in  C  with  F  —  V{,  and  V~  is 
the  number  of  negative  instances  in  C  with  F  = 

This  CLS  algorithm  can  be  viewed  as  a  refinement-operator  algorithm 
with  only  one  operator: 

Specialise  the  current  hypothesis  by  adding  a  new  condition  (a  new 
decision  node). 

The  CLS  algorithm  repeatedly  examines  the  data  during  step  1  to  decide 
which  new  condition  should  be  added.  The  final  decision  tree  developed  by 
CLS  is  a  generalization  of  the  training  instances,  because  in  most  cases  not 
all  features  present  in  the  training  instances  need  to  be  tested  in  the  tree. 
Thus,  CLS  begins  with  a  very  general  hypothesis  and  gradually  specializes  it, 
by  adding  conditions,  until  a  consistent  tree  is  found. 


The  IDS  Learning  Algorithm 

The  CLS  algorithm  requires  that  all  of  the  training  instances  be  available 
on  a  random-access  basis  during  step  1.  This  places  a  practical  limit  on  the  size 
of  the  learning  problems  that  it  can  solve.  The  1D3  algorithm  (Quinlan,  1979, 
in  press)  is  an  extension  to  CLS  designed  to  solve  extremely  large  concept- 
learning  problems.  It  uses  an  active  experiment-planning  approach  to  select 
a  good  subset  of  the  training  instances  and  requires  only  sequential  access  to 
the  whole  set  of  training  instances.  Here  is  an  outline  of  the  ID3  algorithm: 

Step  1.  Select  a  random  subset  of  size  W  of  the  whole  set  of  training 
instances  (IV  is  called  the  window  size,  and  the  subset  is  catted  the 
wmdotv). 

Step  2.  Use  the  CLS  algorithm  to  form  a  rule  to  explain  the  current  window. 

Step  3.  Scan  through  allot  the  training  instances  serially  to  find  exceptions 
to  the  current  rule. 

Step  4.  Form  a  new  window  by  combining  some  of  the  training  instances 
from  the  current  window  with  some  of  the  exceptions  obtained  in 
step  3. 

Repeat  steps  2  through  4  until  there  arc  no  exceptions  to  the  rule. 
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Quinlan  has  experimented  with  two  different  strategies  for  building  the 
new  window  in  step  4.  One  strategy  is  to  retain  all  of  the  instances  from  the 
old  window  and  add  a  user-specified  number  of  the  exceptions  obtained  from 
step  3.  This  gradually  expands  the  window.  The  second  strategy  is  to  retain 
one  training  instance  corresponding  to  each  leaf  node  in  the  current  decision 
tree.  The  remaining  training  instances  are  discarded  from  the  window  and 
replaced  by  exceptions.  Both  methods  work  quite  well,  although  the  second 
method  may  not  converge  if  the  concept  is  so  complex  that  it  cannot  be 
discovered  with  any  window  of  fixed  site  W. 


Application  of  the  IDS  Algorithm 

;  The  ID3  algorithm  has  been  applied  to  the  problem  of  learning  classifi¬ 
cation  rules  for  part  of  a  chess  end-game  in  which  the  only  pieces  remaining 
are  a  white  king  and  rook  and  a  black  king  and  knight.  ID3  has  discovered 
rules  to  describe  the  concept  of  “knight's  side  lost  (in  at  most)  n  moves”  for 
|tt  —  2  and  n  =»  3.  Table  D3b-1  shows  the  results  of  these  processes. 

The  features  describing  the  board  positions  have  been  chosen  to  capture 
patterns  believed  to  be  relevant  to  the  concept  of  loot  in  n  moves.  The  actual 
raw  data  for  the  L  it  in  S  moves  concept  comprise  1.8  million  distinct  board 
positions.  By  choosing  appropriate  features,  Quinlan  was  able  to  compress 
these  into  428  distinct  feature  vectors.  This  is  an  excellent  example  of  the 
,  importance  to  concept  learning  of  good  representation  and  of  knowledge-based 
i  interpretation  of  the  raw  data.  Quinlan  (in  press)  points  out  that  an  important 
,  task  for  future  learning  research  is  to  develop  a  program  that  can  discover  a 
good  set  of  features. 

i 

;  Strengths  and  Weaknesses  of  CLS  and  IDS , 

The  ID3  and  CLS  programs  with  their  very  simple  representations  and 
straightforward  learning  algorithms  perform  impressively  on  the  single-concept 


TADLE  D3b-1 

The  Application  of  1D3  to  a  Chess  End-game 


Concept 

Number  of 
training  instances 

Number  of 
features 

Siie  of 
decision  tree 

Solution 

time 

Lost  in  2  moves 

30,000 

25 

334  nodes 

144  seconds* 

Lost  in  2  move* 

428 

23 

83  nodes 

3  seconds* 

Last  in  3  moves 

715 

39 

177  nodes 

34  seconds* 

‘Using  PASCAL  implementation  on  a  DEC  KL-10. 
*Using  PASCAL  implementation  on  a  CDC  CYBER  72. 
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learning  problem.  Much  of  the  power  of  the  ID3  algorithm  derives  from  its 
sophisticated  selection  of  training  instances.  This  form  of  instance  selection 
has  been  termed  expectation-baaed  filtering  by  Lenat,  Hayes-Roth,  and  Klahr 
(1979).  The  basic  value  of  expectation-based  filtering  is  that  it  focuses  the 
attention  of  the  program  on  those  training  instances  that  violate  its  expec¬ 
tations.  These  are  precisely  the  training  instances  needed  to  improve  the 
program’s  representation  of  the  concept  being  learned.  Even  this  simple  form 
of  experiment  planning  allows  ID3  to  solve  large  learning  problems  efficiently. 

One  of  the  chief  difficulties  of  the  CLS/1D3  method  is  that  the  repre¬ 
sentation  for  learned  concepts  is  a  decision  tree,  and  decision  trees  are  difficult 
to  check  for  equivalence.  What  is  more  important,  it  is  difficult  for  people  to 
understand  the  learned  concept  when  it  is  expressed  as  a  large  decision  tree. 

References 

The  best  discussion  of  BACON  is  Langley  (1980).  The  ID3  algorithm  is 
well  described  in  Quinlan  (in  press). 


D3c.  Concept  Learning  by  Generating  and 
Testing  Plausible  Hypotheses 


TlIG  two  modcl-drivea  approaches  discussed  in  Article  XTV.D1  on  issues — 
generate- and- test  and  schema  instantiation — have  received  little  attention 
from  people  doing  learning  research.  This  article  describes  one  method, 
developed  by  Diettcrich  and  Michalski,  that  discovers  a  single  concept  from 
examples  by  model-driven  generate  and  test.  In  spite  of  using  only  a  very 
simple  model,  this  method  exhibits  the  strengths  and  weaknesses  that  are 
typical  of  model-driven  methods:  It  is  quite  immune  to  noise  but  cannot 
incrementally  modify  its  concept  description  as  new  training  instances  become 
available. 

The  INDUCE  .9.  Algorithm 

Dictterich  and  Michalski  (1081)  address  the  problem  of  learning  a  single 
concept  from  positive  training  instances  only.  Their  program,  INDUCE  1.2, 
is  intended  to  be  applied  in  structural-learning  situations,  that  is,  situations 
in  which  each  training  instance  has  some  internal  structure.  Winston’s  toy- 
Llock  constructions,  for  example,  arc  structural  training  instances;  a  toy-block 
Construction  is  represented  as  a  set  of  nodes  connected  by  structural  relations 
like  ONTOP,  TOUCH,  and  SUPPORTS  (see  Article  XlV.D3a).  Diettcrich  and 
Michalski's  model,  which  guides  the  search  for  generalisations,  expects  the 
learned  concept  to  be  a  conjunction  involving  both  structural  relations  and 
ordinary  features. 

INDUCE  1.2  seeks  to  find  a  few  concepts  in  the  rule  space,  each  of  which 
covers  all  of  the  training  instances  while  remaining  as  specific  as  possible. 
This  learning  problem  is  similar  to  the  problem  of  finding  the  5  set  in  the 
candidate-elimination  algorithm.  INDUCE  1,2,  however,  applies  some  model- 
based  heuristics  to  drastically  prune  the  5  set  so  that  only  a  few  generalisa¬ 
tions  arc  discovered. 

The  program  assumes  that  the  training  instances  have  been  transformed 
so  that  they  can  be  viewed  as  very  specific  points  in  the  rule  space  (i.e.,  it  uses 
the  single-representation  trick).  A  random  sample  of  the  training  instances 
is  chosen.  These  points  in  rule  space  serve  as  the  starting  points  for  a  beam 
search  upward  through  the  rule  space,  that  is,  from  the  very  specific  train¬ 
ing  instances  toward  more  general  concepts.  The  concept  descriptions  arc 
generalized  by  dropping  conjunctive  conditions  and  adding  internal  disjunc¬ 
tive  options  until  they  cover  all  of  the  training  instances.  By  starting  at  the 
most  specific  points  in  the  rule  space  and  stopping  as  soon  as  it  finds  concepts 
that  cover  all  of  the  training  instances,  INDUCE  1.2  is  guaranteed  to  find  the 
most  specific  concepts  that  cover  the  data. 
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The  beam-search  process  has  the  following  steps: 

Step  1.  Initialize.  Set  ll  to  contain  a  randomly  chosen  subset  of  siic  W  of 
the  training  instances  (W  is  a  constant  called  the  beam  width). 

Step  2.  Generate.  Gcneraliie  each  concept  in  II  by  dropping  single  condi¬ 
tions  in  all  possible  ways.  This  produces  all  the  concept  descrip¬ 
tions  that  are  minimally  more  general  than  those  in  ll.  These  form 

the  new  H. 

Step  3.  Prune  implausible  hypotheses.  Remove  all  but  IV  of  the  concept 
descriptions  from  II.  The  pruning  is  based  on  syntactic  characteris¬ 
tics  of  the  concept  description,  such  as  the  number  of  terms  and 
the  user-defined  cost  of  the  terms.  Another  criterion  is  to  maximise 
the  number  of  training  instances  covered  by  each  element  of  II. 

Step  <1.  Test.  Check  each  concept  description  in  II  to  see  if  it  covers  all  of 
the  training  instances.  (This  information  was  obtained  previously 
in  step  3.)  if  any  concept  does,  remove  it  from  II  and  place  it  in  a 
set  C  of  output  concepts. 

Repeat  steps  2,  3,  and  d  until  C  reaches  a  prcspccified  size  limit  or  H 

becomes  empty. 

A  schematic  diagram  of  the  beam-search  process  is  shown  in  Figure  D3c-1. 
Extensions  to  the  Basie  Algorithm 

Structural  learning  problems  of  the  kind  INDUCE  L.2  was  designed  to 
attack  require  binary  (and  higher  order)  predicates  to  represent  the  desired 

more  general 


°  •  Pruned  more  specific 

•  :  Not  Pruned 

*  :  Placed  in  C 

Figure  D3c-1.  A  schematic  diagram  of  INDUCE  l.2’s  beam  search. 
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concepts.  The  binary  predicates  are  needed  to  express  relationships  among 
the  parts  (e.g.,  toy  blocks)  that  make  up  each  training  instance.  In  Winston's 
arch  training  instances,  for  example,  binary  predicates  could  be  used  to  rep¬ 
resent  the  fact  that  two  blocks  are  touching — TOUCU(a,  6)— or  that  one  block 
is  supporting  another — SUPPORTS(a,  6).  Unary  predicates  and  functions  a'e, 
of  course,  still  needi  1  as  well.  Typically,  they  represent  the  attributes  of 
the  parts  of  en  instance.  In  Wins'on’s  arches,  for  example,  unary  predicates 
cci-id  represent  the  site  and  shape  of  each  block.  The  syntactic  distinction 
betv.een  unary  am1  binary  predicates  thus  corresponds  to  a  semantic  distinc¬ 
tion  betv'eer  feature  values  and  binirry  relationships. 

Although  it  is  possible  to  represent  structural  relationships  using  only 
unary  predicates  or  functions,  such  a  representation  is  cumbersome  and  un¬ 
natural.  Consequently,  this  distinction — by  which  binary  anu  higher  order 
predicates  correspond  to  structural  relationships  and  unary  predicates  and 
functions  correspond  to  feature  values — holds  in  most  structural  learning 
situations. 

Diettcrich  and  Michalski  take  advantage  of  this  dichotomy  to  improve 
the  efficiency  of  INDUCE  1.2’s  rule-space  search.  Two  separate  rule  spaces 
are  used.  The  first  rule  space,  called  the  structure-only  space,  is  the  space  of 
all  concepts  expressible  using  only  the  binary  (and  higher  order)  terms  in  the 
representation  language.  The  training  instances  are  abstracted  into  this  space 
(by  dropping  all  unary  predicates  and  functions),  and  then  the  gcneratc-and- 
test  beam  search  is  applied  to  this  abstract  rule  space. 

Once  the  set,  C,  of  candidate  structure-only  concepts  is  obtained,  each 
concept,  e in  C  r  used  to  define  a  new  rule  space,  consisting  of  all  concepts 
expressible  in  terras  of  the  attributes  of  the  subobjects  (e.g.,  blocks)  referred 
to  in  e,-.  This  space  can  be  represented  with  a  simple  feature-vector  repre¬ 
sentation.  The  training  instances  are  transformed  info  very  specific  points  in 
this  space,  and  another  beam  search  is  conducted  to  find  a  set,  C',  of  plausible 
concept  descriptions.  The  descriptions  in  C'  specify  the  attributes  for  the 
subobjects  referred  to  in  Cj.  Takeiii  together,  one  concept  in  C'  combined 
with  e,-  provides  a  complete  concept, description. 

As  an  example  of  this  two-space  approach,  consider  the  two  positiv; 
training  instances  depicted  below:  i 


Instance  1. 


3  u,  v  :  LARGF(u)  A  CTRCLE(u)  A 
LAHGE(v)  K  ClRCLE(v)  A  ONTOP(u,  u) . 
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Imtance  2.  3  w,  x,  y  :  SMAI.L(u )  A  ClItCLE(ui)  A 

\J  LARGE(z)  A  SQUARE(x)  A 

| - 1  LARGE(y)  A  St}UAR2(y)  A 

j  ONTOP(u),  z)  A  ONTOP(x,  y) . 


When  these  two  training  instances  arc  translated  into  the  structure-only  rule 
space,  the  following  abstract  training  instances  are  obtained: 

Instance  1'.  3  ti,  v  :  ONTOP(u, ») . 

Instance  2 3  tv,  z,  y  :  ONTOP(u>,  z)  A  ONTOP(x,  y) . 

The  INDUCE  t.2  beam  search  discovers  that  C  =*  (ONTOP(u,  v)}  is  the  only, 
least  general,  structure-only  concept  consistent  with  the  training  instances. 
Now  a  new  attribute-vector  rule  space  is  developed  with  the  features  of  u 
and  v: 

(SIZE(u), SHAPE(u),  SIZF.(t>), SHAPE(v))  . 

The  training  instances  are  translated  to  obtain: 

Instance  t".  (large,  circle,  large,  circle). 

Instance  2.1".  (small,  circle,  large,  square). 

Instance  2.2".  (large,  square,  large,  square) . 

Notice  that  two  alternative  training  instances  are  obtained  from  instance  2 ', 
since  ONTOP(u,  u)  can  match  instance  2  in  two  possible  ways  (u  bound  to  tv,  v 
bound  to  x;  or  u  bound  to  x,  v  bound  to  y).  During  the  beam  search,  only  one 
of  these  two  instances,  2.1"  and  2.2",  need  be  covered  by  a  concept  description 
for  that  description  to  be  consistent. 

The  second  beam  search  is  conducted  in  this  feature- vector  space,  and  the 
concepts  (large,  •,  large,  •)  and  (•,  circle,  large,  •)  are  found  to  be  the  least 
general  concepts  that  cover  all  of  the  training  instances  (“•”  indicates  that  the 
corresponding  feature  is  irrelevant).  By  combining  each  of  these  feature-only 
concepts  with  the  structure-only  concept  ONTOP(u,  t»),  two  overall  consistent 
concept  descriptions  are  obtained: 

C\\  3  u,  v  :  ONTOP(u,  v)  A  I.ARGE(u)  A  LARCK(v), 

Cj;  3  u,  v  :  ONTOP(u,  v)  A  CIllCMi(u)  A  LARCE(v). 

These  correspond  to  the  observations  that  in  both  instance  1  and  instance  2 
there  arc  (C|)  “always  a  large  object  on  top  of  another  large  object”  and  (Cj) 
“always  a  circle  on  top  of  a  large  object." 
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Strength*  and  Wealmetie*  of  the  INDUCi  t.t  Approach 

The  basic  algorithm  suiters  from  the  absence  of  a  strong  mode!  to  guide 
the  pruning  of  descriptions  in  step  3  and  the  termination  of  the  search  in 
step  4.  The  present  syntactic  criteria,  of  minimising  the  number  of  terms  in 
a  proposed  concept,  minimising  the  user-defined  cost  of  the  terms,  and  max¬ 
imising  the  number  of  training  instances  covered,  are  very  weak.  Dietterich 
and  Michalski  claim  that  domain-specific  information  could  easily  be  applied 
at  this  point  to  improve  the  model-based  pruning. 

A  second  weakness  is  that  step  2  involves  exhaustive  enumeration  of  all 
possible  single-step  generalisations  of  the  hypotheses  in  H.  This  can  be  very 
costly  in  a  large  rule  space.  The  method  of  plausible  generate  and  test  works 
best  if  the  generator  can  be  constrained  to  generate  only  plausible  hypotheses. 
The  generator  in  INDUCE  1.2  relics  on  a  subsequent  pruning  step,  which  is 
quite  costly. 

A  third  weakness  of  the  method  is  that,  because  it  prunes  its  search,  it  is 
incomplete  (see  Dietterich  and  Michalski,  1981).  It  docs  not  find  ail  minimally 
general  concepts  in  the  rule  space  that  cover  all  of  the  training  instances. 

As  with  all  model-driven  methods,  this  approach  does  not  work  well  in 
incremental  learning  situations.  All  of  the  training  instances  must  be  available 
to  the  learning  algorithm  simultaneously. 

The  advantages  of  the  algorithm  are  that  it  is  faster  and  uses  less  memory 
than  the  full  version-space  approach.  As  with  all  model-based  methods, 
INDUCE  1.2  has  good  noise  immunity.  In  particular,  if  INDUCE  1.2  is  to  be 
given  noisy  training  instances,  then  step  4  can  be  modified  to  include  in  C 
the  concepts  that  cover  most,  rather  than  all,  of  the  training  instances. 

Reference* 


Dietterich  and  Michalski  (1981)  describe  INDUCE  1.2. 
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Sc II KM A-INSTA  NTI ATIO N  techniques  have  been  used  in  many  A1  systems 
that  perform  comprehension  tasks  such  as  image  interpretation,  natural- 
language  understanding,  and  speech  understanding.  Few  learning  systems 
have  employed  schema-instantiation  methods,  however.  These  methods  are 
useful  when  a  system  has  a  substantial  number  of  constraints  that  can  be 
grouped  together  to  form  a  schema,  an  abstract  skeletal  rule.  The  search  of 
the  rule  space  can  then  bo  guided  to  only  those  portions  of  the  space  that  fit 
one  of  the  available  schemas.  In  this  section,  we  describe  one  learning  system, 
SPARC,  that  uses  schema  instantiation  to  discover  single  concepts. 

Discovering  Rules  in  Eleuaia  with  SPARC 

IVicttorich's  (P”9)  SPARC  system  attempts  to  solve  a  learning  problem 
that  arises  in  the  card  game  Rleusis.  Eleusis  (developed  by  Robert  Abbott, 
1977;  see  also  Gardner,  1977)  is  a  card  game  in  which  players  attempt  to 
discover  a  secret  rule  invented  by  the  dealer.  The  secret  rule  describes  a  linear 
sequence  of  cards.  In  their  turns,  the  players  attempt  to  extend  this  sequence 
by  playing  additional  cards  from  their  hands.  The  dealer  gives  no  information 
aside  from  indicating  whether  or  not  each  play  is  consistent  with  the  secret 
rule.  Players  are  penalised  for  incorrect  plays  by  having  cards  added  to  their 
hands.  The  game  ends  when  a  player  empties  his  hand. 

A  record  of  the  play  is  maintained  as  a  layout  (see  Fig.  D3d-1)  in  which  the 
top  row,  or  main  line,  contains  all  of  the  correctly  played  cards  in  sequence. 
Incorrect  cards  are  placed  in  aide  lines  below  the  main-line  card  that  they 
follow.  In  the  layout  shown  in  Figure  D3d-1,  the  first  card  correctly  played 
was  the  3  of  hearts  (38).  This  was  followed  by  another  correct  play,  the  9  of 
spades  ( 9 S ) .  Following  the  9,  two  incorrect  plays  were  made  (JD  and  50)  before 
the  next  correct  card  (tc)  was  played  successfully. 


Main  line: 

3H  9S 

4C  90 

2C  100  8H 

78  2C  58 

Side  lines: 

JD 

AB 

AS 

10R 

50 

8H 

10S 

90 

If  the  last  e,.:d  is  add,  play  black;  if  the  last  card  ia  even,  play  red. 


Figure  D3d-1.  An  Elcusis  layout  •>nd  the  corresponding 
secret  rule. 
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The  scoring  in  Elcusis  encourages  "the  dealer  to  choose  rules  of  inter¬ 
mediate  difficulty.  The  dealer's  score  is  determined  by  the  difference  between 
the  highest  and  lowest  scores  of  the  players.  Thus,  a  good  rule  is  one  the.t  is 
easy  for  some  players  and  hard  for  others. 

Schema*  in  E/e  us  is 

In  ordinary  play  of  Eleusis,  certain  classes  of  rules  have  been  observed. 
Dicttcrich  has  identified  three  rule  classes  and  developed  a  paramcteriied 
schema  for  each: 

1.  Periodic  rules.  A  periodic  rule  describes  the  layout  as  a  sequence  of 
repeating  features.  For  example,  the  rule  Play  alternating  red  and  Hack 
card*  is  a  periodic  rule.  Dictlcrieh’s  rule  schema  for  this  class  can  be 
described  as  an  N  tuple  of  conjunctive  descriptions: 

{Ci,  Ci,  . . .,  Cn)  . 

The  parameter  N  is  the  length  of  the  period  (the  number  of  cards  before 
the  period  starts  to  repeat).  The  above-mentioned  periodic  rule  would 
be  represented  as  a  2-tuple: 

(RED(eardi),  SLACK ( card, )) . 

More  complex  periodic  rules  may  refer  to  the  previous  periods.  Vhus, 
the  rule 

<RANK(e«n/,)  >  RANK(car«£j_,),  RANK(carct)  <  RANK(eord._,)) 

describes  a  layout  composed  of  alternating  ascending  and  descending 
sequences  of  cards. 

2.  Decomposition  rules.  A  decomposition  rule  describes  the  layout  by  a 
set  of  if-then  rules.  For  example,  the  rule  //  the  last  card  is  odd,  play  black; 
if  the  last  card  is  even,  play  red  is  a  decomposition  rule.  The  rule  schema 
for  this  class  requires  that  the  set  of  if-then  rules  have  single  conjunctions 
for  the  if  and  then  parts  of  each  rule.  The  if  parts  must  be  mutually 
exclusive,  and  they  must  span  all  possibilities.  The  above-mentioned  rule 
can  be  written  as: 

ODD(e«rdj_,)  =»  BLACK(card,)  V 
EVEN(card,_,)  =>  RKDfcord;). 

3.  Disjunctive  rules.  The  third  class  of  rules  includrs  any  rules  that  can 
be  represented  by  a  single  disjunction  of  conjunctions  (i.c.,  an  expression 
in  disjunctive  normal  form,  or  l>NF).  For  example,  the  rule  Play  a  card 
of  the  same  rank  or  the  same  suit  as  the  preceding  card  is  a  DNF  rule.  This 
is  represented  as: 

RANK(eard,j  —  RANK(card._,)  V  SUITjcard,)  =  SUIT(card._,) . 
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Each  schema  has  a  few  parameters  that  control  its  application.  The  N 
(length  of  period)  parameter  of  the  period  schema  has  already  been  described. 
Each  schema  also  has  a  parameter  L,  called  the  lookback  parameter,  that 
indicates  how  many  cards  back  into  the  past  the  rule  may  consider.  Thus, 
when  L  =  0,  no  preceding  cards  are  examined.  When  L  —  l,  the  features  of 
the  current  card  are  compared  with  the  previous  card,  and  expressions  such 
as  ItANK(cardi)  >  RANK( cardi-i)  are  permitted.  Larger  values  of  L  provide 
for  even  further  lookback. 

Starching  the  Rule  Space  Using  Schema* 

Each  schema  can  be  viewed  as  having  its  own  rule  space— the  set  of  all 
rules  that  can  be  obtained  by  instantiating  that  schema.  SPARC  uses  the 
single-representation  trick  to  reformulate  the  layout  as  a  set  of  very  spccilic 
rules  for  each  of  the  schema-specific  rule  spaces.  The  overall  algorithm  works 
as  follows: 

Step  I.  Parameterize  a  schema.  SPARC  chooses  a  schema  and  selects  par¬ 
ticular  values  for  the  parameters  of  that  schema. 

Step  2.  Interpret  the  training  instance*.  Transform  the  training  instances 
(i.e.,  the  cards  in  the  layout)  into  very  specific  rules  that  fit  the 
chosen  schema. 

Step  3.  Instantiate  the  schema.  Generalise  the  trans'ormeil  training  instances 
to  fit  the  schema.  SPARC  uses  a  schema-specific  algorithm  to 
accomplish  this  step. 

Step  4.  Evaluate  the  instantiated  s ehemr.  Determine  how  well  the  schema  fits 
the  data.  Poorly  fitting  rules  arc  discarded. 

SPARC  conducts  a  depth-first  search  of  the  space  of  all  paramcteriiations 
of  all  schemas  up  to  a  user-specified  limit  on  the  magnitudes  of  the  parameters. 
Notice  that  a  separate  interpretation  step  is  required  for  each  parameterized 
schema. 

When  these  steps  are  applied  to  the  game  shown  in  Figure  D3d-I,  for 
example,  step  l  eventually  chooses  the  decomposition  schema  with  L  =  1. 
Step  2  then  converts  the  training  instances  into  very  specific  rules  in  the  cor¬ 
responding  rule  space.  In  this  case,  the  first  five  cards  produce  the  training 
instances  shown  below.  The  instances  are  represented  by  the  feature  vec¬ 
tor  (RANK,  SUIT.  COI.OR,  PARITY)  to  describe  each  card.  (SPARC  actually 
generates  24  features  to  describe  each  training  instance.) 

Instance  l  (positive).  (3,  hearts,  red,  odd)  =»  (9,  spades,  black,  odd) . 

Instance  2  (negative).  (9,  spades,  black,  oJd)  ( jack ,  diamonds,  red,  odd) . 

Instance  3  (negative).  (9,  spades,  black,  odd)  =»  (5,  diamonds,  red,  odd) . 

Instance  4  (positive).  (9,  spades,  black ,  odd)  =»  (4,  clubs,  black,  even). 


D3d 


Schema  Instantiation 


419 


Step  3  produces  the  following  instantiated  schema  (with  irrelevant  features 
indicated  by  «): 

(•,*,«,  odd)  «♦  («,«,  ilaek,  *)  V  even)  *»  (•,«,«<*). 

Step  4  determines  that  this  rule  is  entirely  consistent  with  the  training  in¬ 
stances  and  is  syntactically  simple.  Consequently,  the  rule  is  accepted  as  a 
hypothesis  for  the  dealer’s  secret  rule. 

The  schema-instantiation  method  works  well  when  step  3,  the  schema- 
instantiation  step,  is  easy  to  accomplish.  A  good  sche,.  .  provides  many 
constraints  that  limit  the  site  of  its  rule  space.  In  SPARC,  for  example,  the 
periodic  and  decomposition  schemas  require  that  their  rules  he  made  up  of 
single  conjuncts  only.  This  is  a  strong  constraint  that  can  be  incorporated  into 
the  model-fitting  algorithm.  On  the  other  hand,  the  DNF  schema  provides 
few  constraints  and,  consequently,  an  efficient  instantiation  algorithm  could 
not  be  written.  The  general-purpose  A*  algorithm  (see  Article  X1V.D3*)  was 
used  instead. 

Strength*  and  Weabie tie*  of  SPARC 

The  schema-instantiation  method  used  in  SPARC  was  able  to  find  plaus:ble 
Eleusis  rules  very  quickly.  This  is  the  primary  advantage  of  the  schema- 
instantiation  approach — large  rule  spaces  can  he  searched  quickly.  A  second 
advantage  of  this  approach  is  that  it  has  good  noise  immunity.  The  schema- 
instantiation  process  has  access  to  the  full  set  of  training  instances,  and,  thus, 
it  can  use  statistical  measures  to  guide  the  scirch  of  rule  space. 

There  arc  three  important  disadvantages  of  the  schema-instantiation 
method  as  used  in  SPARC.  First,  it  is  difficult  to  isolate  a  group  of  con¬ 
straints  and  combine  them  to  form  a  schema.  The  three  schemas  in  SPARC, 
although  they  cover  most  “secret  rules"  pretty  well,  are  known  to  miss  some 
important  rules.  The  task  of  coining  up  with  new  schemas,  however,  is  par¬ 
ticularly  difficult.  A  second  problem  with  the  schema-instantiation  approach 
is  that  special  schema-instantiation  algorithms  must  be  developed  for  each 
schema.  This  makes  it  difficult  to  apply  the  approach  in  new  domains.  The 
third  disadvantage  is  that  separate  interpretation  methods  need  to  be  devel¬ 
oped  for  each  schema.  This  was  less  of  a  problem  in  the  Eleusis  domain,  be¬ 
cause  the  interpretation  processes  for  the  different  schemas  were  very  similar. 

Reference* 

Dietlcrich  (1979)  is  the  original  description  of  the  SPARC  program.  Diet- 
tcrich  (1980)  is  a  more  accessible  source.  See  also  Dicttcrich  and  Michalski 
(in  press). 
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A  FEW  A!  learning  systems  have  been  developed  that  discover  a  set  of  con* 
cepts  from  training  instances.  These  systems  pci  form  tasks,  such  as  disease 
diagnosis  and  mass-spectrometer  simulation,  for  which  a  single  concept  or 
classification  rule  is  not  sufficient. 

To  understand  the  problems  of  learning  multiple  concepts,  it  is  helpful 
to  review  single-concept  learning.  In  single-concept  learning  (sec  Sec.  XIV. D3), 
the  learning  element  is  presented  with  positive  and  negative  instances  of  some 
concept,  and  it  must  (ind  a  concept  description  that  effectively  partitions  the 
space  of  all  instances  into  two  regions:  positive  and  negative.  All  instances  in 
the  positive  region  arc  believed  by  the  learning  system  to  be  examples  of  the 
single  epneept  (see  Fig.  D4-1). 

In  multiple-concept  learning,  the  situation  is  slightly  more  complicated. 
The  learning  element  is  presented  with  training  instances  that  arc  instances 
of  several  concepts,  and  it  must  find  several  concept  descriptions.  For  each 
concept  description,  there  is  a  corresponding  region  in  the  instance  space  (see 
Fig.  D4r2).  An  important  multiple-concept  learning  problem  is  the  problem 
of  discovering  disease-diagnosis  rules  from  training  instances.  The  learning 
element  is  presented  with  training  instances  that  each  contain  a  description 
of  a  patient's  symptoms  and  the  proper  diagnosis  as  determined  by  a  doctor. 
The  program  must  discover  a  set  of  rules  of  the  form: 

(description  of  symptoms  for  disease  A)  =*»  Disease  is  A , 

|  (description  of  symptoms  for  disease  D)  Disease  is  B, 

(description  of  symptoms  for  disease  N)  **  Disease  is  N . 


Instance  Space 


Figure  D4-1.  A  single  concept  viewed  as  a  region 
of  the  instance  space. 
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Figure  D4-2.  Region*  of  the  instance  apace  corre¬ 
sponding  to  different  rules. 

The  left-hand  side  of  each  rule  is  a  concept  description  that  corresponds  to 
a  region  in  the  instance  space  of  all  possible  symptoms  (see  Fig.  D4-2).  Any 
patient  whose  symptoms  fall  in  region  A,  for  example,  will  be  diagnosed  as 
having  disease  A. 

An  important  issue  arising  in  multiple-concept  learning  is  the  problem 
of  overlapping  concept  descriptions — that  is,  overlapping  left-hand  sides  of 
diagnosis  rules.  In  Figure  D4-2,  for  example,  when  a  patient’s  symptoms  fall 
in  the  area  where  regions  A  and  B  overlap,  the  system  will  diagnose  the  patient 
as  having  both  diseases  A  and  B.  This  overlap  may  be  correct,  since  there 
are  often  cases  in  which  a  patient  has  more  than  one  disease  simultaneously. 
On  the  other  hand,  it  is  often  the  case  in  multiple-concept  problems  that 
the  various  classes  are  intended  to  be  mutually  exclusive.  For  example,  il’, 
instead  of  diagnosing  diseases,  the  performance  task  is  to  classify  images  of 
handwritten  characters,  it  is  important  that  the  system  arrive  at  a  unique 
classification  for  each  character. 

The  problem  of  overlap  among  multiple  concepts  can  lead  to  integration 
problems,  as  described  in  Article  XIV. A.  When  a  new  rule  or  concept  is  added 
to  the  knowledge  base  in  a  multiple-concept  system,  it  may  be  necessary  to 
modify  the  left-hand  sides  of  existing  rules,  particularly  if  the  concept  classes 
are  intended  to  be  mutually  exclusive. 

The  systems  described  in  this  section  differ  from  those  described  in  the 
Section  XIV.09  on  multiple-step  tasks  in  that  the  performance  tasks  dis¬ 
cussed  here  can  all  be  accomplished  in  a  single  step.  The  various  discase- 
classilicalion  rules,  for  example,  can  be  applied  simultaneously  to  classify  a 
patient's  symptoms.  Tasks  for  which  this  is  not  the  case — like  playing  check¬ 
ers  or  solving  symbolic  integration  problems — are  discussed  in  Section  XIV.D5. 

We  first  discuss  the  work  of  Michalski  and  his  colleagues  on  the  AQU 
program,  which  learns  a  set  of  classification  rules  for  the  diagnosis  of  soybean 
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diseases.  Second,  we  describe  the  Meta-t$ENDRAL  system,  which  learns  a  set 
of  cleavage  rules  that  describe  the  operation  of  a  chemical  instrument  called 
the  mass  spectrometer.  Finally,  the  AM  system,  which  discovers  new  concepts 
in  mathematics,  is  discussed  in  some  detail.  Since  these  systems  do  not  all 
address  the  same  learning  problem,  we  begin  each  article  with  a  description  of 
the  particular  learning  problem  being  attacked  and  then  discuss  the  methods 
employed  to  accomplish  the  learning. 


D4a.  AQ11 

MlCIIALSKI  and  his  colleagues  (Michalski  and  Larson,  1978;  Michalski  and 
Chilausky,  1980)  have  developed  several  techniques  for  learning  a  set  of  classi¬ 
fication  rules.  The  performance  clement  that  applies  these  rules  is  a  pattern 
classifier  that  takes  an  unknown  pattern  and  classifies  it  into  one  of  n  classes 
(see  Fig.  D4a-1).  Many  performance  tasks,  such  as  optical  character  recogni¬ 
tion  and  disease  diagnosis,  have  this  form. 

The  classification  rules  are  learned  from  training  instances  consisting  of 
sample  patterns  and  their  correct  classifications.  For  the  classifier  to  be  as 
efficient  as  possible,  the  classification  rules  should  test  as  few  features  of  the 
input  pattern  as  necessary  to  classify  it  reliably.  This  is  particularly  relevant  in 
areas  like  medicine,  where  the  measurement  or  each  additional  feature  of  the 
input  pattern  may  be  very  costly  and  dangerous.  Consequently,  Mirhalski's 
learning  program  AQI1  (Michalski  and  Larson,  1978)  seeks  to  find  the  most 
general  rule  in  the  rule  space  that  discriminates  training  instances  in  class  c,- 
from  all  training  instances  in  all  other  classes  c,  (i  ^  j).  Dictterich  and 
Michalski  (1981)  call  these  discriminant  descriptions  or  discrimination  rules, 
since  their  purpose  is  to  discriminate  one  class  from  a  predetermined  set  of 
other  classes. 

Using  the  A *  Algorithm  to  Find  Discrimination  Rules 

The  representation  language  used  by  Michalski  to  represent  discrimina¬ 
tion  rules  is  VL(,  an  extension  of  the  propositional  calculus.  VLt  is  a  fairly  rich 


Figure  D4a-l.  The  n-catcgory  classification  task. 
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language  that  includes  conjunction,  disjunction,  and  set-membership  opera¬ 
tors.  Consequently,  the  rule  space  of  ail  possible  VL|  discrimination  rules  is 
quite  large.  To  search  this  rule  space,  AOll  uses  the  Af  algorithm,  which 
is  nearly  equivalent  to  the  repeated  application  of  the  candidate-elimination 
algorithm  (see  Article  XIV. D3a).  AQ11  converts  the  problem  of  learning  dis¬ 
crimination  rules  inuj  a  series  of  single-concept  learning  problems.  To  find  a 
rule  for  class  c,,  it  considers  all  of  the  known  instances  in  class  c^  as  positive 
instances  and  all  other  training  instances  in  ail  of  the  remaining  classes  as 
negative  instances.  The  A*  algorithm  is  then  applied  to  find  a  description 
that  covers  all  of  the  positive  instances  without  covering  any  of  the  negative 
instances.  AQll  seeks  the  most  general  such  description,  which  corresponds 
to  a  necessary  condition  for  class  membership.  Figure  D-ta-2  shows  schemati¬ 
cally  how  this  works.  The  dots  represent  known  training  instances,  and  the 
circle  represents  the  set  of  (xtssible  training  instances  that  arc  covered  by  the 
description  of  class  C|. 

For  each  class  c<,  such  a  “concept”  is  discovered.  The  result  is  shown 
schematically  in  Figure  D-la-” 

Note  that  the  discrimination  rules  may  overlap  in  regions  of  the  instance 
space  that  have  not  yet  been  observed.  This  overlap  is  useful  because  it 
allows  the  performance  element  to  be  somewhat  conservative.  In  the  areas  in 
which  the  discrimination  rules  are  ambiguous  (i.e.,  overlap),  the  performance 
element  can  report  this  to  the  user  rather  than  assign  the  unknown  instance 
to  one  arbitrarily  chosen  class. 

AQll  also  has  a  method  for  finding  a  nunoverlapping  set  of  classification 
rules.  Since  the  algorithm  uses  thesingle-represcr  tation  trick,  it  can  accept 
not  only  single  points  in  the  instance  space  (as  rep-'  sented  by  very  specific 
points  in  the  rule  space)  but  also  generaliied  “instances”  that  are  conjunct* 
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Figure  D4a-3.  Finding  single  concepts  for  each  class. 

in  the  rule  apace  corresponding  to  sets  of  training  instances.  This  allows  AQll 
to  treat  the  concept  description t  themselves  as  negative  examples  when  it  is 
learning  the  concept  description  for  a  subsequent  class.  Thus,  in  order  to 
oolaiu  a  ronoverlapping  set  of  discrimination  rules,  AQll  takes  as  its  positive 
instances  all  known  instances  in  c,  and  as  its  negative  instances  all  known 
instances  in  Cy  {j  »)  plus  all  conjuncts  that  make  up  the  discrimination 
rules  for  previously  processed  classes  ck  [k  <  i).  The  resulting  disjoint  rules 
are  shown  schematically  in  Figure  D-la-'l  (assuming  the  classes  were  processed 
in  the  order  C|,  c»,  cj). 

The  rules  that  arc  developed  split  up  the  unobserved  part  of  the  instance 
space  in  such  a  way  that  ei  gets  the  largest  share,  cj  covers  any  space  not 
covered  by  C|,  cj  covers  any  space  not  covered  by  C|  or  cI(  and  so  on.  The  way 
in  which  the  space  ->  Uividwu  depends  on  the  order  in  which  the  classes  axe 
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Figure  D4a-*l.  Finding  nonoverlapping  classification  rules. 
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processed.  A  performance  element  that  uses  such  a  disjoint  set  of  concepts 
will  be  reckless  in  the  sense  that  it  will  assign  an  unknown  instance  to  an 
arbitrary  class.  The  classifier  arbitrarily  prefers  Ci  to  e2,  cs  to  cj,  and  so  on. 

The  discrimination  rules  developed  by  AQll  correspond  (roughly)  to  the 
set  of  most  genera!  descriptions  consistent  with  the  training  instances — the 
G  set  in  tiie  candidate-elimination  algorithm  (see  Sec.  XlV.Dn).  In  many 
situations,  it  is  also  good  to  develop,  for  each  class  c,,  the  most  specific  (5-set) 
description  of  that  class.  This  permits  very  explicit  handling  of  the  unobserved 
portions  of  the  space.  Figure  D4a-5  shows  such  a  set  of  descriptions. 

When  S  and  G  sets  arc  both  available,  the  performance  clement  can 
choose  among  ’'efinite  classification  (the  instance  is  covered  by  the  5  set), 
probable  classification  (the  instance  is  covered  by  only  one  G  set),  and  multiple 
classification  (the  instance  is  covered  by  several  G  sets).  AQll  has  the  ability 
to  calculate  an  approximate  5  set  for  each  class.  When  the  description  of  the 
class  is  disjunctive,  the  5  set  is  also  disjunctive. 

Application s  of  AQll 

The  AQll  program  has  been  applied  to  the  problem  of  discovering  disease- 
diagnosis  rules  for  15  soybean  diseases  (Michalski  and  Chilausky,  1980).  Here 
is  an  example  of  a  classification  rule  for  the  disease  Rhizoctonia  root  rot 
obtained  by  the  overlapping-concept  approach  discussed  above: 

leaves  €  (normal)  A  stem  6  (abnormal)  A 

stem  cankers  €  (below  soil  line)  A  canker  lesion  color  €  (brown)  V 

leaf  malformation  €  (absent)  A  stem  €  (abnormal)  A 

stem  cankers  €  (below  soil  line)  A  canker  lesion  color  €  (brown) 

»*  Rhizoctonia  root  rot . 
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Figure  D4a-5.  Learning  both  the  G  and  5  set  descriptions 
for  each  class. 
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An  interesting  experiment  was  conducted  as  part  of  the  soybean  disease 
project.  The  goaJ  was  to  compart  the  quality  of  rules  obtained  through 
consultation  with  expert  plant  pathologists  with  rules  developed  by  learning 
from  examples.  Descriptions  of  630  diseased  soybean  plants  were  entered  into 
the  computer  (as  feature  vectors  involving  35  features)  along  with  an  expert's 
diagnosis  of  each  plant.  A  special  instance-selection  program,  ESEL,  was  used 
to  select  200  of  the  sample  plants  as  training  instances.  ESEL  attempts  to 
select  training  instances  that  are  quite  different  from  one  another — instances 
that  are  "far  apart”  in  the  instance  space.  The  remaining  340  instances 
were  set  aside  to  serve  as  a  testing  set  for  comparing  the  performance  of  the 
machine-derived  rules  with  the  performance  of  the  expert-derived  rules. 

AQU  was  then  run  on  the  290  training  instances  to  develop  overlapping 
rules  such  as  the  rule  above.  Simultaneously,  the  researchers  consulted  with 
the  plant  pathologist  to  obtain  a  set  of  rules.  They  adopted  the  standard 
knowledge-engineering  approach  of  interviewing  the  expert  and  translating 
his  expertise  into  diagnosis  rules.  The  expert  insisted  on  using  a  description 
language  that  was  somewhat  more  expressive  than  the  language  used  by  AQU. 
The  expert's  rules,  for  example,  listed  some  features  as  necessary  and  other 
features  as  confirmatory;  AQtl  was  unable  to  make  such  a  distinction. 

As  a  consequence  of  the  differing  description  languages,  slightly  differing 
performance  elements  had  to  be  developed  to  apply  the  two  sets  of  rules,  and 
each  performance  element  was  adjusted  to  get  the  best  performance  from  its 
classification  rules.  Surprisingly,  the  computer-generated  rules  outperformed 
the  expert-derived  rules.  Despite  the  fact  that  the  expert-derived  rules  were 
expressed  iu  a  more  powerful  language,  the  machine-generated  rules  gave  the 
correct  disease  top  ranking  97.835  of  the  time,  compared  to  only  71.895  for  the 
expert-derived  rules.  Overall,  the  machine-generated  rules  listed  the  correct 
disease  among  the  possible  diagnoses  10035  of  the  time,  in  contrast  to  98.935 
for  the  expert’s  rules.  Furthermore,  the  computer-derived  rules  tended  to 
list  fewer  alternative  diagnoses.  The  conclusion  of  the  experiment  was  that 
automatic  rule  induction  can,  in  some  situations,  lead  to  more  reliable  and 
more  precise  diagnosis  rules  than  those  obtained  by  consultation  with  the 
expert. 

Reference* 

Michalski  and  l.arson  (1978)  describe  the  AQ1I  and  ESEL  programs  in 
detail.  The  soybean  work  is  described  in  Michalski  and  Chilausky  (1980). 
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META-DENDRAL  (Buchanan  and  Mitchell,  1078)  is  a  program  that  discovers 
rules  describing  the  operation  of  a  chemical  instrument  called  a  mate  spec* 
trometer.  The  mass  spectrometer  is  a  device  that  bombards  small  chemical 
samples  with  accelerated  electrons,  causing  the  molecules  of  the  sample  to 
break  apart  into  many  charged  fragments.  The  masses  of  these  fragments  can 
then  be  measured  to  produce  a  moss  spectrum— a  histogram  of  the  number 
of  fragments  (also  called  the  intensity)  plotted  against  their  mass-to-charge 
ratio  (see  Rig.  D4b-l). 

An  analytic  chemist  can  infer  the  molecular  structure  of  the  sample 
chemical  through  careful  inspection  of  the  mass  spectrum.  The  Heuristic 
DENDRAL  program  (see  Sec.  Vll.CS,  in  Vol.  Il)  is  able  to  perform  this  task 
automatically,  it  is  supplied  with  the  chemical  formula  (but  not  the  structure) 
of  the  sample  and  its  mass  spectrum.  Heuristic  DENDRAL  first  examines  the 
spectrum  to  obtain  a  set  of  constraints.  These  constraints  are  then  supplied 
to  CONGEN,  a  program  that  can  generate  alt  possible  chemical  structures 
satisfying  the  constraints.  Finally,  each  of  these  generated  structures  is  tested 
by  running  it  through  a  mass-spectrometer  simulator.  The  simulator  applies 
a  set  of  cleavage  rule t  to  predict  which  bonds  in  the  proposed  structure  will 
be  broken.  The  result  is  a  simulated  mass  spectrum  for  each  candidate 
structure.  The  simulated  spectra  are  compared  with  the  actual  spectrum,  and 
the  structure  whose  simulated  spectrum  best  matches  the  actual  spectrum  is 
ranked  as  the  moat  likely  structure  for  the  unknown  sample. 


Figure  D'lb-1.  A  mass  spectrum. 


428 


Meta-DENDRAL 


429 


Mb 

The  Learning  Problem 

Meta-DENDRAL  was  designed  to  serve  as  the  learning  element  for  Heu¬ 
ristic  DENDRAL.  (For  an  alternate  view  of  Meta-DENDRAL  as  an  expert 
system,  sec  Article  Vll.C2c,  in  Vol.  II.)  Its  purpose  is  to  discover  new  cleavage 
rules  for  DENDRAL's  mass-spectrometer  sinulator.  These  rules  arc  grouped 
according  to  structural  families.  Chcini-  -i  have  noted  that  molecules  that 
share  the  same  structural  skeleton  behave  in  similar  ways  inside  the  mass 
spectrometer.  Conversely,  molecules  with  vastly  different  structures  behave 
in  vastly  ditTerent  ways.  Thus,  no  single  set  of  cleavage  rules  can  accurately 
describe  the  behavior  of  all  molecules  in  the  mass  spectrometer. 

Figure  l)-lb-2  shows  an  example  of  a  structural  skeleton  for  the  family 
of  monoketoandrostanes.  Particular  molecules  in  this  family  are  constructed 
by  attaching  keto  groups  (OH)  to  any  of  the  available  carbon  atoms  in  the 
skeleton. 

The  learning  problem  addressed  by  Meta-DENDRAL  is  to  discover  the 
cleavage  rules  for  a  particular  structural  family.  The  problem  can  be  stated 
as  follows: 

Given:  (a)  A  representation  language  for  describing  molecular  structures 
and  substructures;  and 

(b)  A  training  set  of  known  molecules,  chosen  from  a  single  struc¬ 
tural  family,  along  with  their  structures  and  their  mass  spec¬ 
tra; 

Find:  A  set  of  cleavage  rules  that  characterise  the  behavior  of  this  struc¬ 

tural  family  in  the  mass  spectrometer. 

This  learning  problem  is  difficult  because  it  contains  two  sources  of  ambiguity. 
First,  the  mass  spectra  of  the  training  molecules  arc  noise-ridden.  There  may 
be  falsely  ohserved  fragments  (false  positives)  and  important  fragments  that 
may  not  have  been  observed  (false  negatives).  Second,  the  cleavage  rules  need 


Figure  Ddb-2.  The  structural  skeleton  for  the  monoketo- 
androstane  family. 
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not  be  entirely  consistent  with  the  training  instances.  A  rule  that  correctly 
prcdic*  a  cleavage  in  more  than  half  of  the  molecules  can  be  considered  to 
be  acceptable;  the  rules  need  not  be  cautious.  It  is  safer — from  the  point  of 
view  of  DENDRAL’s  simulation  task — to  predict  cleavages  that  do  not  occur 
than  it  ii  to  fail  to  predict  cleavages  that  do  occur. 

Meta-DENDltAL’a  representation  language  corresponds  to  the  ball-and- 
stick  models  used  by  chemists.  The  molecule  is  represented  as  an  undirected 
graph  in  which  nodes  denote  atoms  aud  edges  denote  chemical  bonds.  Hydro¬ 
gen  atoms  arc  not  included  in  the  graph.  Each  atom  can  have  four  features: 
(a)  the  atom  type  (c.g.,  carbon,  nitrogen),  (b)  the  number  of  nonhydrogen 
neighbors,  (c)  the  number  of  hydrogen  atoms  that  are  bonded  to  the  atom,  and 
(d)  the  number  of  double  bonds  in  which  the  atom  participates.  A  cleavage 
rule  is  expressed  in  terms  of  a  bond  environment — a  portion  of  the  molecular 
structure  surrounding  a  particular  bond.  The  bond  environment  makes  up 
the  condition  part  of  a  cleavage  rule.  The  action  part  of  the  rule  specifies 
that  the  designated  bond  will  cleave  in  the  mass  spectrometer.  Figure  D4b-3 
shows  a  typical  cleavage  rule. 

The  performance  element  (the  simulator)  applies  the  production  rule  by 
matching  the  left-hand-side  bond  environment  to  the  molecular  structure  that 
is  undergoing  simulated  bombardment.  Whenever  the  left-hand-side  pattern 
is  matched,  the  right-hand-side  predicts  that  the  bond  designated  by  »  will 
break. 


The  Interpretation  Problem  and  the  Subprogram  INTSUM 

Mcta-DENDRAL  employs  the  method  of  model-driven  gcnerate-aml-test 
to  search  the  rule  space  of  possible  cleavage  rules.  Before  it  can  carry  out 
this  search,  however,  it  must  first  interpret  the  training  instances  and  convert 
them  into  very  specific  points  in  the  rule  space  (i.c.,  into  very  specific  cleavage 
rules). 


* — y — z — w  =>  x — y  *  * — w 


Node  Atom  type  Neighbors  U-neighbors  Double  bonds 


*  carbon  3 

y  carbon  2 

x  nitrogen  2 

u>  carbon  2 


1 

2 
1 
2 


0 

0 

0 

0 


Figure  D4l>-3.  A  typical  cleavage  rule. 


Meta- D  END  RAL 


431 


D4b 

The  interpretation  process  is  accomplished  by  the  subprogram  INTSUM 
(INTcrpretation  and  SUMmary).  Recall  that  the  training  instances  have  the 
form: 


(whole  molecular  structure)  *4  (mass  spectrum) . 

INTSUM  seeks  to  develop  a  set  of  very  specific  cleavage  rules  of  the  form: 
(whole  molecular  structure)  >4  (one  designated  broken  boad) . 

To  make  this  conversion,  INTSUM  must  hypothesise  which  bonds  were 
broken  to  produce  which  peaks  in  the  spectrum.  It  accomplishes  this  by  means 
of  a  “dumb”  version  of  the  DENDRAL  mass-spectrometer  simulator.  Since 
Mcta-DENDRAI.  is  attempting  to  discover  cleavage  rules  for  this  particular 
structural  class,  it  cannot  use  those  same  cleavage  rules  to  drive  the  simula¬ 
tion.  Instead,  a  simple  half-order  theory  of  mass  spectrometry  is  adopted. 

The  half-order  theory  describes  the  action  of  the  mass  spectrometer  as 
a  sequence  of  complete  fragmentations  of  the  molecule.  One  fragmentation 
slices  the  molecule  into  two  pieces.  A  subsequent  fragmentation  may  further 
split  one  of  those  two  pieces  to  create  two  smaller  pieces,  and  so  on.  After 
each  fragmentation,  some  atoms  from  one  piece  of  the  molecule  may  migrate 
to  the  other  piece  (or  be  lost  altogether).  The  half-order  theory  places  certain 
constraints  on  this  split- and- migrate  process.  It  says  that  all  bonds  will  break 
in  the  molecule  except  the  following: 

1.  Double  and  triple  bonds  do  not  break; 

2.  Bonds  in  aromatic  rings  do  not  break; 

3.  Two  bonds  involving  the  same  atom  do  not  break  simultaneously; 

4.  No  more  than  three  bonds  break  simultaneously; 

5.  At  most,  only  two  fragmentations  ocour  (one  after  the  other); 

6.  No  more  than  two  rings  can  be  split  as  the  result  of  both  of  the  frag¬ 
mentations. 

Constraints  are  also  placed  on  the  kinds  of  migrations  that  can  occur: 

l.  No  more  than  two  hydrogen  atoms  migrate  after  a  fragmentation; 

.  2.  At  most,  one  HjO  is  lost; 

3.  At  most,  one  CO  is  lost. 

The  parameters  of  the  theory  are  flexible  and  can  be  adjusted  by  the  user  of 
Mcta-DENDRAL. 

Bused  on  this  theory,  INTSUM  simulates  the  bombarding  and  cleaving  of 
the  molecular  structures  provided  in  the  training  instances.  The  result  is  a 
simulated  spectrum  in  which  each  simulated  peak  has  an  associated  record 
of  the  bond  cleavages  that  caused  that  peak  to  appear.  Each  simulated 
peak  is  compared  with  the  actual  observed  peaks.  If  their  masses  match, 
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then  INTSUM  infers  that  the  “cause”  of  the  simulated  peak  is  a  plausible 
explanation  of  the  observed  peak.  If  a  simulated  peak  finds  no  matching 
observed  peak,  it  is  ignored.  If  an  observed  peak  remains  unexplained,  it  is 
also  ignored.  However,  unexplained  peaks  arc  reported  to  the  chemist.  A  large 
proportion  of  unexplained  peaks  would  indicate  that  the  half-order  theory  was 
inadequate  to  explain  the  operation  of  the  mass  spectrometer  in  this  training 
instance. 

The  half-order  theory  contributes  another  source  of  ambiguity  to  the 
learning  problem.  The  interpreted  set  of  training  instances  can  easily  contain 
erroneous  instances.  INTSUM’s  half-order  theory  tends  to  predict  cleavages 
that  did  not,  in  fact,  occur.  It  is  also  not  unusual  for  the  half-order  theory 
to  fail  to  predict  cleavages  that  did  occur.  Thus,  the  training  instances  that 
guide  the  rule  space  search  are  very  noisy  indeed. 

The  Search  of  the  Rule  Space 

Meta-DENDRAL  searches  the  rule  space  in  two  phases.  First,  a  model- 
driven  generate- and- test  search  is  conducted  by  the  RULKflEN  subprogram. 
This  is  a  fairly  coarse  search  from  which  redundant  and  approximate  rules 
may  result.  The  secoud  phase  of  the  search  is  conducted  by  the  RULEMOD 
subprogram,  which  cleans  up  the  rules  developed  by  RULECEN  to  make  them 
more  precise  and  less  redundant. 

RULEGEN.  This  subprogram  searches  the  rule  space  of  bond  environ¬ 
ments  in  order  from  most  general  to  most  specific.  The  algorithm  repeatedly 
generates  a  new  set  of  hypotheses,  II,  and  tests  it  against  the  (positive)  train¬ 
ing  instances  developed  by  INTSUM,  as  follows: 

Step  1.  Initialize  H  to  contain  the  moet  general  bond  environment. 
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thus  predicts  that  every  bond  will  break.  Since  the  most  useful 
(i.e.,  most  accurate)  bond  environment  lies  somewhere  between  this 
overly  general  environment  (a  •  y)  and  the  overly  specific,  complete 
molecular  structure  (with  specified  bonds  breaking),  the  program 
generates  refined  environments  by  successively  specialising  the  II 
set. 

Step  2.  Generate  a  nev  sel  of  hypotheses.  Specialise  the  set  II  by  making 
a  change  to  all  atoms  at  a  specified  distance  (radius)  from  the 
•  bond — the  bond  designated  to  break.  The  change  can  involve 
cither  adding  new  neighbor  atoms  or  specifying  an  atom  feature. 
All  possible  specialisations  are  made  for  which  there  is  supporting 
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evidence.  The  technique  of  modifying  all  atoms  at  a  particular 
radius  causes  the  RULEGEN  search  to  he  coarse. 

Step  3.  Tett  the  hypotheeee  afsnut  the  training  irutaneet.  The  bond  environ¬ 
ments  in  H  are  examined  to  determine  how  much  evidence  there 
is  for  each  environment.  An  improvement  criterion  is  computed  for 
each  environment  that  states  whether  the  environment  is  more 
plausible  than  the  parent  environment  from  which  it  was  obtained  ‘ 
by  specialisation.  Environments  that  are  determined  to  be  more 
plausible  than  their  parents  are  retained.  The  others  are  pruned 
from  the  H  set.  If  all  specialisations  of  a  parent  environment  are 
determined  to  be  less  plausible  than  their  parent,  the  parent  is 
output  as  a  new  cleavage  rule  and  is  removed  from  H. 

Repeat  steps  2  and  3  until  H  is  empty. 

I 

Figure  D4b~4  shows  a  portion  of  the  RULEGEN  search  tree.  Uorisontal 
levels  in  the  tree  correspond  to  the  contents  of  the  //  set  after  each  itera¬ 
tion.  Starting  with  the  root  pattern,  So,  the  nnmber-of-neighbon  attribute 
ia  specialised  (i.e.,  the  pattern  graph  is  expanded)  for  each  atom  at  distance 
sero  from  (adjacent  to)  the  break  to  give  pattern  S j.  The  atom  type  is  then 
specified  for  atoms  adjacent  to  the  break  in  St  and  for  atoms  one  bond 
removed  from  the  break  in  5j.  At  each  step,  there  are  many  other  pos¬ 
sible  successors  corresponding  to  assignments  of  other  values  to  these  same 
attributes  or  to  other  attributes. 

The  improvement  criterion  used  in  step  3  states  that  a  daughter  environ¬ 
ment  graph  ia  more  plausible  than  its  parent  graph  if; 

1.  It  predicts  fewer  fragmentations  per  molecule  (i.e.,  it  is  more  specific); 


XoX(Sv) 


Figure  D4b-4.  A  portion  of  the  RULKCEN  search  tree. 
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It  stilt  predicts  fragmentations  for  al  least  half  of  all  of  the  molecules 
(i.e.,  it  is  sufficiently  general); 

It  predicts  fragmentations  for  as  many  molecules  as  its  parent — unless 
the  parent  graph  was  “too  general”  in  the  sense  that  the  parent  predicts 
more  than  2  fragmentations  in  some  single  molecule  or  on  the  average 
it  predicts  more  than  l.S  fragmentations  per  molecule. 

This  algorithm  assumes  that  the  improvement  criterion  increases  tnono- 
tonically  to  a  single  maximum  value  (i.e.,  it  is  unimodal).  This  is  usually  true 
for  the  mass-spectrometry  learning  task.  IIULEGEN  can  thus  be  viewed  as 
following  monotonically  increasing  paths  down  through  the  partial  order  of 
the  rule  space  until  the  criterion  attains  a  local  maximum  value. 

RULEMOD.  The  rules  produced  by  RULKGEN  arc  very  approximate  and 
have  not  been  tested  against  negative  evidence.  RUl.EMOD  improves  these 
rules  by  conducting  fine  hill-climbing  searches  in  the  portions  of  the  rule  space 
near  the  rules  located  by  RULECEN.  The  subprogram  RULEMOD  proceeds 
in  four  steps: 

Step  t.  Select  a  ju beet  of  important  rate*.  ItULEGEN  can  produce  rules  that 
are  different  from  one  another  but  that  explain  many  of  the  same 
data  points.  RULEMOD  attempts  to  find  a  small  set  of  rules  that 
account  for  all  of  the  data.  Negative  evidence  is  gathered  for 
each  rule  by  re-invoking  the  mass-spectrometer  simulator.  Each 
candidate  rule  is  tested  to  sec  how  many  incorrect  predictions  are 
made  aa  well  as  how  many  correct  predictions.  The  rules  are  ranked 
according  to  a  scoring  function  (/  X  (P  +  U  —  21V),  where  /  is  the 
average  intensity  of  the  positively  predicted  peaks,  /’  is  the  number 
of  correctly  predicted  peaks,  U  is  the  number  of  correct  peaks 
predicted  uniquely  by  this  ntle  and  no  other,  and  N  is  the  number 
of  incorrectly  predicted  peaks).  .The  top-ranked  rule  is  selected. 

All  evidence  peaks  explained  by  that  rule  arc  removed,  and  the 
ranking  and  selection  process  is  repeated  until  all  positive  evidence 
is  explained  or  until  the  scores  fall  below  a  specified  threshold. 

Step  2.  Specialize  ru.>j  to  exclude  negative  evidence.  RUl.EMOD  attempts  to 
specialise  the  rules  in  order  to  exclude  some  negative  evidence  while 
retaining  the  positive  evidence.  For  each  candidate  rule,  RULEMOD 
attempts  to  fill  in  additional  values  for  fcaturrs  that  were  left 
unspecified  by  RULEGEN.  RUl.EMOD  first  examines  all  of  the 
positive  instances  predicted  by  the  candidate  rule  and  obtains  a  list 
of  all  possible  feature  values  that  are  common  to  all  of  the  positive 
instances.  Each  of  these  feature  values  could  individually  be  added 
to  the  rule  without  excluding  any  positive  instances.  RULEMOD 
attempts  to  select  a  mutually  compatible  set  of  values  that  will  . 
exclude  a  large  amount  of  negative  evidence. 
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The  selection  process  uses  a  hill-climbing  search.  The  feature  value 
that  excludes  the  largest  number  of  negative  instances  is  chosen 
and  added  to  the  candidate  rule.  Incompatible  feature  values  are 
pruned  from  the  list  of  possible  refinements,  and  the  process  is 
repeated  until  further  refinement  is  not  possible  or  all  negative 
evidence  has  been  excluded. 

Step  3.  Generalize  mitt  to  include  poritive  evidence.  RULKMOD  attempts 
to  generalise  the  rules  in  order  to  include  some  positive  evidence 
without  including  any  new  negative  evidence.  This  is  accomplished 
by  relaxing  the  legal  values  for  atom  features  that  were  specified  by 
RULECEN.  RULEMOD  examines  each  atom  in  the  bond  environ¬ 
ment  of  the  rule,  starting  with  the  atoms  most  distant  from  the  • 
bond.  U  Grst  checks  to  see  if  the  whole  atom  can  be  removed  from 
the  graph  without  introducing  any  negative  evidence.  If  it  cannot, 
then  a  hill-climbing  search  is  performed  that  iteratively  removes 
the  one  atom  feature  that  allows  the  rule  to  include  the  largest 
amount  of  new  positive  evidence  without  introducing  any  negative 
evidence.  When  the  outermost  atoms  have  been  generalised  as 
much  as  possible,  RULECEN  examines  the  set  of  atoms  that  are 
one  bond  closer  to  the  fragmentation  site.  This  search  continues 
until  all  possible  changes  have  been  made. 

Step  4.  Select  the  final  tuheet  of  rule t.  The  procedure  used  in  step  1  is  re¬ 
applied  to  select  the  final  set  of  rules. 

The  key  assumption  made  by  RULEMOD  is  that  RULECEN  has  located  rules 
that  are  approximately  correct.  RULECEN  points  out  the  regions  of  the  rule 
space  in  which  detailed  searches  are  needed. 

Notice  that  RULEMOD  must  frequently  invoke  the  mass-spectrometer 
simulator  to  assess  the  negative  (incorrect)  predictions  of  a  proposed  rule. 
1NTSUM  provides  only  positive  training  instances  to  RULECEN.  Negative 
instances  are  not  provided  to  RULECEN  directly  because  there  are  many 
more  negative  instances  than  there  are  positive  instances.  This  is  a  problem 
that  frequently  arises  in  systems  that  are  attempting  to  explain  why  some 
particular  set  of  events  took  place.  Negative  information  must  indicate  every¬ 
thing  that  did  not  occur. 

All  three  of  Mcta-DENDRAL’s  subprograms  make  use  of  some  form  of 
the  mass-spectrometer  simulator.  These  versions  of  the  simulator  are  Qexibte 
and  transparent.  They  allow  the  learning  element  to  interpret  the  training 
instances  and  to  reason  about  the  performance  of  a  hypothetical  modification 
to  the  cleavage  rules.  Similar  transparent  performance  elements  are  used  in 
systems  that  learn  to  perform  inultiplo-stop  Lusks  (see  Sec.  XIV.D5). 

Experiment  planning  and  the  search  of  the  instance  space.  Meta- 
DENDRAL  docs  not  conduct  a  search  of  the  instance  space.  Such  a  search 
would  require  that  Meta-DENDRAL  select  a  molecular  structure  and  ask 
the  chemists  to  synthesise  it  and  obtain  its  mass  spectrum.  To  choose  an 
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appropriate  molecule,  Meta-DENDRAL  would  need  to  invert  the  INTSUM 
process.  Given  a  set  of  possible  bond  cleavages  that  it  wanted  to  verify,  Meta- 
DENDRAL  would  need  to  determine  a  molecule  in  which  those  bonds  would 
cleave.  Once  the  molecule  was  chosen,  existing  organic-synthesis  programs 
could  be  used  to  plan  the  synthesis  process  (see  Article  VII. C4,  in  Voi.  ll).  The 
chosen  molecule  might  be  diilicult  or  impossible  to  synthesize.  Instance-space 
searching  was  not  incorporated  into  Meta-DENDRAL  because  of  the  complex 
and  time-consuming  nature  of  these  procedures. 

Another  View  of  the  Meta-DENDRAL  Learning  Algorithm 

In  the  previous  section,  we  discussed  the  RULECRN/RULEMOD  pair  of 
subprograms  as  a  coarse  search  followed  by  a  fine  search.  Another  view  of 
this  process  is  that  RULEGEN  converts  a  multiple-concept  learning  problem 
into  a  set  of  single-concept  learning  problems.  This  view  regards  the  output 
of  RUI.ECEN  not  as  a  set  of  rules  but  as  a  clustering  of  the  training  instances. 
Once  RULEGEN  has  completed  its  search,  the  program  knows  approximately 
which  training  instances  belong  together  us  instances  of  a  single  cleavage  rule. 
At  this  point,  a  ringlc-conccpt  learning  algorithm  could  be  applied  to  discover 
this  rule  directly  from  the  RULEGKN-supplied  cluster  of  training  instances 
rather  than  by  incremental  modifications  of  the  RULEGEN-supplicJ  rule. 

As  part  of  his  thesis  work,  Mitchell  (1978)  applied  the  candidate- 
elimination  algorithm  to  this  learning  problem.  Each  approximate  rule  devel¬ 
oped  by  RULEGEN  was  used  to  build  a  set  of  positive  and  negative  training 
instances  that  were  then  processed  by  the  version-space  approach.  This 
technique  resulted  in  a  better  set  of  cleavage  rules  than  those  developed 
with  RULEMOD.  The  version-space  approach  has  the  advantage  of  support¬ 
ing  incremental  learning,  so  Mitchell's  system  can  incorporate  new  training 
instances  as  they  become  available. 

Strengths  and  Weaknesses  of  the  Meta-DENDR/\L  System 

Meta-DENDRAL  is  an  effective  learning  system  applied  to  a  real-world 
domain.  Meta-DENDRAL  has  discovered  cleavage  rules  for  five  structural 
families  of  molecules.  The  system  provides  solutions  to  the  problem  of  inter¬ 
preting  training  instances  and  to  the  problem  of  learning  in  the  presence  of 
certain  kinds  of  noise.  These  solutions  are  based  on  the  incorporation  into 
the  program  of  a  large  amount  of  domain-specific  knowledge.  This  knowledge 
enters  the  system  in  the  form  of  the  half-order  theory  of  mass  spectrometry 
(to  guide  interpretation)  and  in  the  use  of  a  model-directed  search  of  rule 
space. 

The  two-phase  search  of  the  rule  space  provides  an  efficient  method  for 
searching  a  large  space  and  also  suggests  how  a  multiple-concept  learning 
problem  can  be  converted  into  a  set  of  single-concept  learning  problems. 
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Among  the  weaknesses  of  the  system'are  it*  domain-specific  representation 
and  the  fact  that  much  of  the  domain  knowledge  ia  buried  in  the  code  rather 
than  represented  u  an  explicit  knowledge  baae. 

Rtjtttiutu 

Lindsay,  Duchsnan,  Feigenbaum,  and  Lederberg  (1980)  present  a  com¬ 
prehensive  survey  of  the  many  programs  developed  during  the  DENDRAL 
project.  Buchanan  and  Mitchell  (1978)  describe  Meta-DENDRAL  as  an  Ai 
learning  system.  Mitchell  (1978)  discusses  the  application  of  the  candidate- 
elimination  algorithm  to  Meta-DENDRAL. 


D4c.  AM 


AM  is  a  computer  program  written  by  Douglas  Lcnat  (1976)  that  discovers 
concepts  in  elementary  mathematics  and  set  theory.  Unlike  moat  of  the 
learning  systems  described  in  this  chapter,  AM  does  not  learn  concepts  for 
use  in  some  performance  task.  Instead,  it  seeks  simply  to  define  and  evaluate 
interesting  concepts  on  the  basis  of  a  knowledge  of  mathematical  aesthetics. 
It  employs  a  refinement-operator  approach  {see  Article  XIV. Dl)  to  conduct  a 
hv  'istic  search  of  a  space  of  mathematical  concepts. 

<uM  starts  with  a  substantial  knowledge  base  of  1 15  concepts  selected  from 
finite  set  theory.  As  AM  runs,  it  collects  examples  of  these  concept:-,  creates 
new  concepts,  and  hypothesises  conjectures  relating  the  concepts  to  each 
other.  During  one  typical  run  of  a  few  CPU  hours’  duration,  AM  defined  about 
200  new  concepts,  hair  of  which  were  quite  welt  known  in  mathematics.  One 
of  the  synthesized  concepts  was  equivalent  to  the  concept  of  natural  numbers. 
AM’s  knowledge  of  mathematical  aesthetics  led  it  to  pursue  this  concept  in 
depth,  and  it  spent  much  time  developing  elemeutary  number  theory,  includ¬ 
ing  conjecturing  the  fundamental  theorem  of  arithmetic  (i.e.,  every  number 
has  a  unique  prime  factorization).  This  impressive  performance  can  be  traced 
to  AM’s  large  body  of  knowledge  about  mathematics  and  its  ability  to  apply 
thus  knowledge  to  discover  new  concepts  and  conjectures. 

In  this  article,  we  first  describe  AM's  architecture  in  terms  of  its  repre¬ 
sentation  for  concepts  and  its  control  structure  for  deciding  what  tasks  to 
perform.  Then  we  change  our  perspective  and  show  how  AM  can  be  viewed  as 
searching  an  instance  space  and  a  concept  space  by  the  refinement-operator 
method.  Third,  we  examine  the  initial  contents  of  AM’s  knowledge  base  and 
review  briclly  the  concepts  that  it  discovered.  Finally,  we  attempt  to  sum¬ 
marise  the  strengths  and  weaknesses  of  AM's  approach  ta  concept  discovery. 

AM’s  Architecture 

AM  is  a  blend  of  three  powerful  methods:  frame  representr  Son,  production 
systems,  and  heuristically  guided  beat-first  search.  We  discuss  each  of  these 
in  turn. 

Frame  representation.  The  concepts  that  AM  discovers  and  manipu¬ 
lates  are  represented  as  frames  (sec  Article  IH.C7,  in  Vol.  l),  each  containing 
the  same  fixed  set  of  slots.  Each  concept  h;is  slots  for  its  definition,  for  known 
positive  and  negative  examples,  far  links  to  other  concepts  that  arc  specializa¬ 
tions  and  generalisations  of  the  concept,  for  telling  the  worth  of  the  concept, 
and  for  several  other  things.  Figure  Ddc-l  shows  the  frame  representation  of 
the  PRIMES  concept  after  it  has  been  discovered  and  filled  in  by  AM. 
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RAME:  PrlM  luabwra 

Dtrumois : 

OilCII:  Wuabwr-ot-dlTlsors-of (x)  ■  2 

PREDICATE-CALCULUS :  PriM<x)  m  (Vx)(*  |  x  »  i  >  1  $  i  >  i) 
ITERATIYE:  (lor  x  >  1)  :  For  1  fro«  2  to  *qrt(x) .  ->(i  [  x) 
EXAMPLES:  2.  ).  S.  .  11.  13.  17 
BGURDART :  2,  3 
BOUKDAIT-FAILURES :  0,  1 
FAILURES:  12 

GERKRAL1ZATI0RS :  Vo*..  Ron.  with  an  avan  no.  of  divisors. 

Hot.  with  a  prlM  no.  of  dlrlaora 

SPECIALIZAT TORS :  Odd  Prlaaa,  PrlM  Paira,  PrlM  Uniquoly-sddablaa 

CORJECTURES:  Unique  factorization.  Goldbach’a  eonjocturo, 

CxtroMa  at  Ru*bar-of -dlvlaors-of 

ARALOGIES : 

Maxlaally  dlriaiblo  nurbora  aro  convana  axtraaaa  ot 
Ruabar-of -divisors-ol , 

Factor  a  nonalaplo  group  into  alapla  groupa 

IHTEREST:  Conjacturaa  aaaociatlng  Prinoa  with  TIMES 
and  with  Dlvlsors-of 

WORTH:  800 


Figure  D4c-1.  AM’s  frame  representation  of  the  PRIMES  concept. 

The  DEFINITIONS  slot  is  the  most  important,  it  provides  one  or  more  LISP 
predicates  that  can  be  applied  to  determine  whether  something  is  an  example 
of  the  concept.  AM  knows  a  concept  when  it  has  a  definition  for  it.  Howwtr, 
the  frame  representation  allows  AM  to  represent  more  knowledge  about  a 
concept  than  just  its  definition.  The  CONJECTURES,  SPECIALIZATIONS,  and 
GENERALIZATIONS  slots,  for  example,  all  describe  different  ways  in  which 
concepts  are  related  to  each  other.  Furthermore,  attached  to  each  slot  in  a 
concept  arc  heuristic  rules  (not  shown  in  the  ligurc)  that  can  be  executed  to 
fill  in  the  contents  of  a  slot  or  to  cheek  the  contents  to  see  if  they  are  correct. 
These  heuristic  rules  form  a  production  system  that  carries  out  the  actual 
discovery  process. 
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Productiou  systems.  AM  operates  as  a  modified  production  tyittm. 
Each  of  the  242  heuristic  rules  attached  to  the  concept  slots  of  AM’s  knowledge 
base  is  written,  as  in  all  production  systems,  as  a  condition  part  and  an 
action  part.  The  condition  part  tells  under  what  conditions  the  rule  should 
be  executed,  and  the  action  part  carries  out  some  task  such  as  creating  a  new 
concept  or  finding  examples  of  an  existing  concept.  For  instance,  the  following 
heuristic  rule  is  attached  to  the  EXAMPLES  slot  of  the  .\N\  -CONCEPT  frame: 

If:  The  current  task  is  “Fill  in  examples  of  X " 

and  X  is  a  specialisation  of  some  concept  Y, 

Then:  Apply  the  definition  of  X  to  each  of  the  examples  of  Y 
and  retain  those  that  satisfy  the  definition. 

The  main  difference  between  AM’s  production-system  architecture  and 
the  standard  rccogniic-act  cycle  is  the  way  rules  are  selected  for  execution. 
Ilecali  that  in  an  ordinary  production  system,  the  condition  part  of  each 
rule  is  compared  to  the  contents  of  a  working  memory,  and  all  rules  that 
match  arc  executed.  In  contrast,  AM  is  much  more  selective  about  which 
rules  it  executes.  It  operates  from  an  agenda  of  tasks  of  the  form  “Fill  in  (or 
check)  slot  S  of  concept  C."  Each  task  has  a  numeric  “interestingness”  rating. 
AM  repeatedly  selects  the  most  interesting  task  from  the  agenda,  gathers  all 
heuristic  rules  relevant  to  performing  that  task,  and  executes  those  rules  that 
are  actually  applicable. 

To  locate  those  heuristics  that  are  relevant  to  the  task  “Fill  in  (or  check) 
slot  S  of  concept  C ,”  AM  looks  at  slot  S  or  concept  C  to  see  if  it  has  any 
attached  heuristics.  If  it  does,  those  heuristics  are  executed.  If  not,  AM 
examines  relatives  of  concept  C  to  sec  if  any  of  them  have  heuristics  that  can 
be  inherited  by  C  and  applied.  For  example,  when  AM  is  looking  for  rules 
relevant  to  the  task  “Fill  in  examples  of  sets,”  it  finds  no  heuristics  attached 
to  the  EXAMPLES  slot  of  SETS.  Consequently,  it  looks  at  concepts  such  as 
ANYOONCEPT,  which  arc  more  general  than  SETS.  The  EXAMPLES  slot  of 
ANYCONCEPT  has  an  attached  heuristic  that  says: - 

If:  The  current  task  is  “Fill  in  examples  of  X” 

and  X  has  a  recursive  definition, 

Then:  Instantiate  the  base  step  of  the  recursion  to  get  — 

a  boundary  example. 

When  AM  applies  this  heuristic  rule,  it  creates  the  null  set  as  a  boundary 
EXAMPLE  of  SETS.  Heuristics  that  are  closely  related  to  C  arc  executed  before 
heuristics  of  distant  relatives. 

A  heuristic  rule  can  do  one  or  more  of  the  following: 

1.  Fill  in  slot  S  of  jome  concept  C.  This  covers  many  activities,  including 
finding  new  examples  for  a  concept,  proposing  conjectures,  and  providing 
guidance  for  the  search  by  modifying  the  WORTH  slot  of  a  concept. 
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2.  Cheek  t lot  S  of  concept  C.  The  process  of  checking  a  slot  involve  verifying 
that  the  content*  of  the  slot  are  correct  and  noticing  interesting  facta 
about  a  slot.  Often,  a  rule  will  check  a  slot  and  notice  that  some  new 
task  should  be  performed  as  a  result.  For  example,  one  rule  notices  that 
all  of  the  examples  of  one  concept,  X ,  are  also  examples  of  a  more  specific 
concept,  Y.  It  conjectures  that  X  and  Y  are  equivalent  and  proposes 
the  task  “Cheek  examples  of  Y"  to  see  if  Y  it  actually  equivalent  to  an 
even  more  specific  concept,  2. 

3.  Create  new  concept!  New  concepts  are  created  by  adding  a  new  frame 
to  the  knowledge  base  and  filling  in  the  DEFINITIONS  slot  of  the  frame. 
Usually  the  WOKTH  slot  is  filled  in  as  well. 

4.  Aid  new  tasks  to  the  ay enda.  Often,  a  rule  wilt  propose  that  a  new  task 
be  added  to  the  agenda.  For  example,  a  rule  that  creates  a  new  concept, 

X,  will  propose  the  new  task  “Fill  in  examples  of  X."  Most  rules  that 
generate  examples  of  .Y  will  propose  the  task  “Cheek  examples  of  X.n 

5.  Modify  the  inter  citing  nett  of  a  tack  on  the  agenda.  The  numerical  interest- 
ingness  of  a  task  is  computed  from  a  list  of  “reasons”  for  performing 
the  task.  Thus,  a  rule  can  add  a  new  reason  to  an  existing  task.  This 
is  another  way  of  providing  guidance  in  the  search  for  concepts  and 
conjectures. 

Beat-flrst  search.  The  procedure  of  always  choosing  the  moat  interest¬ 
ing  task  from  the  agenda  gives  AM  the  flavor  of  best-first  search.  This  search  is 
well  guided  by  heuristics  that  modify  the  INTERESTING  NESS  and  WORTH  slot* 
of  concepts  and  that  propose  and  justify  agenda  tasks.  AM  has  59  heuristics 
for  assessing  the  intcrestiagncs*  of  concepts  and  tasks.  One  rule,  for  example, 
says  that  a  concept  is  interesting  if  each  of  its  examples  accidentally  satisfies 
an  otherwise  rarely  satisfied  predicate  P.  (The  satisfaction  is  accidental  if  the 
concept  was  not  deliberately  defined  as  the  set  of  things  satisfying  P.) 

Without  heuristic  guidance  and  the  agenda  mechanism,  AM  would  be 
swamped  by  a  combinatorial  explosion  of  new  concepts.  However,  the  Tact 
that  it  creates  only  200  new  concepts  and  that  half  of  them  are  acceptable  to 
a  mathematician  shows  that  its  search  is  quite  restrained.  AM  is  an  excellent 
example  of  the  power  of  well-informed  best-first  search. 


AM  and  the  Two-ipaee  View  of  Learning 

Thus  far,  we  have  discussed  the  architecture  of  AM.  We  new  turn  our 
attention  to  how  this  architecture  is  used  to  accomplish  learning.  Although 
its  212  heuristic  rules  arc  extremely  varied  and  can  perform  many  diverse 
functions,  AM  tends  to  behave  as  if  it  were  executing  the  following  loop: 

Repeat: 

Step  t.  Select  a  concept  to  evaluate  and  generate  examples  of  it. 


\ 
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Step  2.  Check  these  examples  looking  for  regularities.  Dased  on  the  regu¬ 
larities, 

(a)  update  the  assessment  of  the  interestingness  of  the  concept, 

(b)  create  new  concepts,  and 

(c)  create  new  conjectures. 

Step  .1.  Propagate  the  knowledge  gained  {especially  from  new  conjectures) 
to  other  concepts  in  the  system. 

In  terms  of  the  two-spnee  view  of  learning,  step  1  searches  a  space  of  instances, 
step  2  examines  these  instances  and  searches  the  space  of  concepts  (the  rule 
spare)  and  conjectures,  and  step  3  performs  bookkeeping  to  maintain  the 
consistency  and  integration  of  the  knowledge  base.  We  examine  each  of  these 
steps  in  more  detad. 

Searching  the  instance  space.  When  a  concept  is  created.  AM  knows 
very  little  about  that  concept  aside  from  its  LISP  definition.  In  fact,  when 
AM  is  first  started  tip.  none  of  its  l  If)  initial  concept  frames  has  any  examples 
filled  in.  Thus,  one  of  the  first  tasks  it  must  perform— in  order  to  assess  the 
value  of  the  concepts  and  develop  conjectures — is  to  gather  examples  (and 
negative  exam  pit's)  of  its  concepts.  AM  has  more  than  30  heuristic  rules  to 
guide  this  example-generating  process.  Here  are  sonic  of  the  techniques  they 
use: 

l.  Symbolic  instantiation  of  definitions.  Symbolic  instantiation  converts  the 
definition  of  a  concept  into  an  example.  Typically,  each  concept  has, 
as  one  of  its  definitions,  a  recursive  LISP  predicate.  The  base  step  of 
this  recursion  can  be  instantiated  to  give  an  instance  that  satisfies  the 
definition.  For  example,  one  of  the  definitions  of  the  SHT  concept  is: 

(Lambda  (*) 

(or  (=  a  {}) 

(set. definition  (remove  (any-membor  a)  a))))  . 

Since  the  first  thing  this  definition  checks  is  to  see  if  *  is  the  null  set, 
we  can  conclude  that  the  null  set  is  an  example  of  a  set.  Similarly,  AM 
knows  that  removing  is  the  opposite  of  insertinj,  so  it  ran  deduce  that 
{{}}  is  also  a  set  by  inserting  {}  into  itself. 

2.  Generate  and  test.  Another  approach  used  by  the  program  is  to  generate 
example?  and  test  them  against  the  concept  definition.  In  order  to 
generate  examples  of  some  concept  C,  the  program  looks  at  “nearby" 
concepts  m  the  knowledge  base.  For  example,  AM  may  look  at  generalisa¬ 
tions  of  C  (concepts  more  general  than  (7),  operations  that  have  C  in 
their  range,  cousins  of  (7  (concepts  that  share  a  common  generalisation 
or  sprcinli7.nl ion  with  C),  and  even  random  LISP  atoms  from  various 
internal  lists  inside  AM  (such  as  the  list  of  users  of  the  system). 

3.  Inheritance  of  examples.  If  concept  C  has  other  concepts  that  are  more 
specialised  than  it,  any  example  satisfying  these  more  specialised  concept 
definitions  will  satisfy  C.  Kxainplcs  can  thus  be  inherited  “up”  the 
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generalisation  hierarchy.  Similarly, 'negative  examples  can  be  inherited 
‘down*  the  generalisation  hierarchy. 

4.  Apptyinf  the  aJyorithm  of  the  concept.  So-cailed  active  concepts  (i.e.,  opera¬ 
tors  such  as  SET-UNION)  have  algorithms  that  compute  an  element  in 
the  range  of  the  concept  when  given  valid  arguments  from  the  domain. 
Thus,  by  randomly  selecting  domain  items  and  applying  these  algo¬ 
rithms,  AM  can  produce  new  examples.  For  instance,  if  {/l}  and  {£1} 
are  sets,  then  SET-UNION. ALCOKITIIMS  produces  {/l,  0},  and  the  list 
({/!},  (U),  (A,  U})  forms  a  positive  example  of  SET-UNION. 

5.  Rcosonmi  6y  stews  or  try  anotoy y.  The  VIEWS  slot  of  a  concept  provides 
an  algorithm  for  converting  instances  of  one  concept  into  instances  of 
another.  The  ANALOGY  slot  j,ives  less  precise  information  about  how 
instances  of  one  concept  arc  related  to  instances  of  another  concept.  AM 
can  use  these  two  slots  to  map  existing  examples  into  examples  of  the 
concept  under  construction. 

When  AM  needs  to  (ill  in  examples  of  a  concept,  it  attempts  to  apply  these 
methods  until  it  has  developed  28  examples  of  the  concept  (or  until  it  has 
exhausted  its  time  or  space  quota  for  the  current  task). 

A  particularly  interesting  feature  of  AM  is  its  ability  to  locate  the  bound¬ 
ary  of  a  concept.  Examples  of  a  concept  arc  classified  according  to  whether 
they  are: 

1.  Normal  positive  examples, 

2.  Boundary  positive  examples, 

3.  Boundary  negative  examples  (i.e.,  what  Winston,  1070,  calls  near  muses), 

4.  Normal  negative  examples,  or 

9.  Just  plain  weird  (i.e.,  have  the  wrong  data  structure). 

Most  examples  produced  by  the  above-mentioned  techniques  will  turn  out  to 
be  normal  positive  examples  (or  normal  negative  examples,  if  they  do  not 
satisfy  the  concept  definition).  Some  of  the  example-generation  techniques, 
however,  are  faulty.  They  can  accidentally  generate  negative  examples.  A 
particular  case  is  the  VIEW  slot  of  SETS  that  tells  AM  that  it  can  view  a  bag 
as  a  set  by  changing  the  (|  brackets  (that  represent  a  bag)  to  {  }  l  ces.  This 
does  not  always  work  (e.g.,  when  the  bag  (a,  b,  a]  is  viewed  as  that  et  {a,b,  o} 
which  contains  an  impermissible  duplicate  element).  When  AM  checks  these 
examples  against  the  definition  of  a  set,  it  discovers  that  they  fail.  Such 
negative  examples  arc  classified  ns  boundary  negative  examples. 

Boundary  positive  examples  can  be  found  by  such  techniques  as  instan¬ 
tiating  the  base  case  of  a  recursion  (which  almost  always  produces  a  boundary 
case)  or  by  taking  boundary  nnn-cxatnpics  of  more  specialized  concepts  and 
determining  that  they  satisfy  the  concept  definition.  Another  technique  is  to 
take  a  normal  positive  example  and  progressively  modify  it  until  it  fails  to 
satisfy  the  definition.  This  isolates  the  boundary  of  the  concept  quite  well. 
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By  applying  all  of  these  techniques,  AM  is  able  to  gather  a  good  set 
if  examples  that  can 'be  used  for  analysis  and  generalisation.  AM  can  also 
assess  how  much  effort  was  expended  to  obtain  these  examples.  Thus,  it  can 
conclude  that  a  predicate  is  “rarely  satisfied”  or  “easily  satisfied.”  All  of  these 
empirical  data  are  used  to  drive  the  search  of  the  rule  space  and  the  search 
for  interesting  conjectures. 

Searching  the  rule  space.  The  rule  space  for  AM  is  the  space  of 
all  possible  instantiations  of  its  concept  frame.  This  is  indeed  an  immense 
space.  To  search  it,  AM  applies  a  refinement-operator  method  similar  to  the 
techniques  employed  by  BACON  and  1D3  (see  Article  XTV. DJb).  The  current 
set  of  concept  frames  can  be  thought  of  as  AM’s  current  set  of  hypotheses. 
These  hypotheses  are  repeatedly  refined  and  extended  by  applying  operators 
(i.e.,  heuristics)  that  create  new  concepts  and  conjectures. 

AM  has  roughly  40  heuristics  that  create  new  concepts.  These  can  be 
broken  into  two  sets.  One  set  of  heuristics  is  general  and  can  be  applied  to 
virtually  any  concept  in  AM.  The  second  set  is  applicable  only  to  functions 
and  relations — active  concepts  that  can  be  viewed  as  mapping  elements  from 
some  domain  set  into  some  range  set.  The  general  methods  are: 

1.  Generalization.  AM  implements,  in  some  form,  virtually  all  rules  of 
generalisation  that  have  appeared  in  other  A1  programs.  The  dropping- 
condition,  adding-option,  and  lurning-constants-lo-variablcs  rules  are 
all  used.  Also  implemented  is  the  technique  of  specialising  a  negative 
conjunct  (e.g.,  A  A  -'B  is  generalised  to  A  A  -’O',  where  O'  is  more 
specific  than  B).  AM  can  generalise  expressions  involving  quantification, 
for  example,  converting  3x  6  S  :  P(x)  to  3x  €  S'  :  P(x),  where  S' 
is  a  larger  set  than  5.  Since  the  definitions  of  concepts  are  typically 
recursive  MSI’  functions,  AM  contains  many  rules  or  generalisation  that 
are  applicable  to  recursion,  for  instance,  a  definition  can  be  generalised 
by  eliminating  one  of  a  conjoined  pair  of  recursive  calls  or  by  disjoining 
a  new  recursive  call.  In  particular,  AM  knows  that  if  one  recursive  call 
involves  CAR  (or  CDR),  the  other  recursive  call  should  use  CDR  (or  CAR, 
respectively). 

2.  Specialization.  AM  also  implements  a  wide  variety  of  rules  of  specialisa¬ 
tion.  These  are  the  reversals  of  the  rules  of  generalisation  mentioned 
above. 

3.  Handling  exception  a.  When  a  concept  has  a  lot  of  exceptions  (negative 
boundary  examples),  a  new  concept  can  be  created  whose  instances 
are  these  negative  examples.  Also,  AM  can  create  the  concept  whose 
instances  are  those  positive  examples,  but  not  boundary  examples,  of 
the  original  concept.  This  allows  AM  to  represent  the  conjecture  that 
all  prime  numbers  are  odd — except  the  number  2. 

4.  Reatoning  by  analogy.  If  /  is  a  conjecture  and  J'  is  an  analogous  conjec¬ 
ture,  then  AM  can  create  the  concept  (6'  j  J'{b')}  and  also  the  concept 
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{4'  |  *'•/'(&')},  that  U,  the  set  of  objects  for  which  J'  U  true  and  the  set 
of  objects  for  which  J'  is  false. 

AM’s  concept-creation  methods  that  apply  to  active  concepts  (mappings) 
usually  produce  new  active  concepts.  New  concepts  can  be  created  by  the 
following: 

1.  Centralization.  The  domain  and  range  of  an  existing  concept  can  be 
expanded. 

2.  Specialization.  The  domain  and  range  of  an  existing  concept  can  be 
contracted  (restricted). 

3.  Inversion.  The  inverse  of  an  existing  relation  can  be  created.  AM  can  also 
create  interesting  concepts  such  as  the  inverse  image  of  an  interesting 
subset  of  the  range  and  the  inverse  image  of  an  interesting  value  in  the 
range. 

4.  Composition.  Two  functions  F{x)  and  C(y)  can  be  composed  to  obtain 
the  new  functions  F(G(y))  and  G(F(x)). 

5.  Projection.  An  existing  multiple-argument  function  F  can  be  projected 
onto  a  subset  of  its  arguments.  For  example,  Proj2(F(x,  y))  is  just  y. 

6.  Coalesce.  The  arguments  of  F[x,  y )  can  be  coalesced  to  produce  a  new 
function,  G[x )  =«  F[x,x). 

7.  Canonization.  This  method  takes  two  predicates,  Pi  and  PJp  and 
defines  a  function,  F,  and  a  set,  the  range  of  F,  such  that  Pi(x,  y)  » 
Pj(F(x),  F(y)}.  If  x  and  y  are  instances  or  concept  C,  then  F  maps  C  to 
the  set  of  canonical  C.  Thus,  Pa  applied  to  canonical  C  is  the  same  as 
Pi  applied  to  C.  AM  uses  this  operation  to  invent  NUMBERS  by  taking 
SAME-SlZE(x,  y)  as  P(,  and  EQUAL(x,  y)  as  Pa,  and  applying  them  to 
bags  to  create  the  canonising  function  SIZE  -OF(x)  and  the  concept  of 
CANONICAL-DAGS  (i.o.,  bags  that  contain  only  T).  CANONICAL-BACS 
can  be  interpreted  as  numbers. 

8.  raratlel-replacc  and  parallel-join.  These  concept-creation  operators  come 
in  many  varieties  and  are  used  to  create  new  concepts  by  repeated 
application  of  old  concepts.  Multiplication,  for  example,  can  be  created 
by  repented  addition  (with  the  parallel-replace  method). 

0.  Permutation.  The  arguments  of  a  function  or  relation  can  be  permuted 
to  give  a  new  function  or  relation. 

10.  Cartesian  product.  A  new  concept  can  be  obtained  by  taking  the  Cartesian 
product  of  existing  concepts. 

Many  of  the  refinement  ooerators  in  this  group  (e.g.,  COALESCE,  COMPOSI¬ 
TION)  arc  also  concept »  defined  in  AM.  It  is  perhaps  only  in  mathematics  that 
the  means  of  study  are  also  the  objects  of  study. 
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Representing  and  proposing  conjectures.  Roughly  30  of  AM's  rules 
also  propose  conjectures  based  upon  examination  of  the  empirical  data.  Con* 
jectures  take  one  of  the  following  forms: 

1.  Ci  is  an  example  of  Cf, 

2.  Ci  is  a  specialisation  (generalisation)  of  Cs; 

3.  C i  is  equivalent  to  Cj; 

4.  Ci  is  related  by  X  to  C%  (where  A'  is  some  predicate); 

5.  Operation  Ci  has  domain  D  or  range  R. 

Most  of  these  conjectures  are  discovered  by  performing  rough  statistical 
comparisons  of  examples.  If  all  of  the  examples  of  Ci  are  also  examples  of 
C7j,  then  AM  conjectures  that  is  a  specialisation  of  Ct.  If  AM  is  unable 
to  find  negative  examples  of  C\ ,  it  conjectures  that  C i  is  trivially  true.  If 
all  examples  of  elements  in  the  range  of  C|  seem  to  be  numbers,  then  AM 
conjectures  that  Ci  has  numbers  as  its  range.  If  all  of  the  range  elements  of 
C|  are  equal  to  corresponding  domain  elements,  then  perhaps  C\  is  the  same 
as  the  identity  function. 

Conjectures,  once  proposed,  arc  believed  completely  by  AM.  The  relevant 
slots  are  changed,  and  the  changes  arc  propagated  throughout  the  knowledge 
base.  If  two  concepts  are  conjectured  to  be  equivalent,  they  are  merged  and 
the  space  occupied  by  one  is  released.  AM  can  also  modify  the  LISP  definitions 
to  take  advantage  of  new  conjectures. 

Propagating  acquired  knowledge.  Several  heuristics  (including  those 
that  locate  and  generate  examples)  serve  to  propagate  new  information  through¬ 
out  the  network  of  frames  that  constitutes  AM’s  knowledge  base.  These  are 
fairly  straightforward  and  make  heavy  use  of  the  three  sets  of  inheritance 
links  (13-AN-EXAMPLE-OF/EXAMPLES,  SPECIALIZATIONS /GENERALIZATIONS, 
DOMAIN/RANGE).  ; 

To  complete  our  review  of  AM  from  the  perspective  of  the  two-space 
view  of  learning,  we  note  that,  although  the  example-generation  tech¬ 
niques  discussed  above  perform  sophisticated  instance  selection,  there  is  no 
corresponding  need  for  complex  interpretation  routines  like  those  found  in 
Meta-DENDRAL.  On  the  contrary,  since  mathematical  objects  are  easily  rep¬ 
resented  and  manipulated  in  LISP,  there  is  no  need  to  convert  them  to  some 
alternate  representation.  More  sophisticated  instance  selection  and  inter¬ 
pretation  routines  would  probably  be  needed  for  nonmathcmatical  domains. 

AM’s  Initial  Knowledge  Dose 

Wc  now  turn  our  attention  to  AM’s  actual  performance.  First  we  describe 
the  knowledge  that  it  started  with,  and  then  we  give  a  summary  of  the 
concepts  and  conjectures  it  found. 
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AM’s  initial  knowledge  base  contains  the  basic  concept  hierarchy  shown 
in  Figure  D4e~2.  In  addition,  beneath  the  concept  of  STRUCTURE  are  many 
important  data  structures:  SETS,  ORDERED  SETS,  BAGS,  LISTS  (i.e.,  ordered 
BAGS),  and  ORDERED  PAIRS.  Under  the  ACTIVITY  concept  a-a  many  operar 
tions  such  as  SET -INTERSECT,  SET-UNION,  SET-DIFFERENCE,  and  SET- 
DELETION  (and  analogous  operations  for  BACS,  ORDERED  SETT,  and  LISTS). 
Also,  several  of  the  concept-creation  operators  such  as  PARALLEL-JOIN, 
RESTRICT,  PROJECTION,  and  so  forth,  arc  included  here.  Under  PREDICATES 
are  the  constant  predicates  TRUE  and  FALSE,  as  well  as  the  concept  of  EQUAL¬ 
ITY.  Finally,  the  most  important  part  of  the  initial  knowledge  base  is  the  body 
of  242  heuristic  rules  attached  to  various  concepts  in  this  tree.  Most  of  these 
were  summarised  above. 

Result*:  AM  a*  a  Mathematician 

Now  we  review  the  mathematics  that  AM  explored.  Throughout,  AM 
acted  alone,  with  a  human  user  watching  it  and  occasionally  renaming  some 
concepts  for  his  (or  her)  own  benefit.  Like  a  contemporary  historian  sum¬ 
marising  the  work  of  the  Babylonian  mathematicians,  we  will  use  present-day 
terms  to  describe  AM's  concepts,  and  we  will  criticise  its  behavior  in  light  of 
our  current  knowledge  of  mathematics. 
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AM  began  its  investigations  with  scanty  knowledge  of  a  few  set- theoretic 
concepts.  Most  of  the  obvious  sct-thcoretical  relations  (e.g.,  de  Morgan's 
laws)  were  eventually  uncovered;  since  AM  never  fully  understood  abstract 
algebra,  the  statement  and  verification  of  each  of  these  was  quite  obscure.  AM 
never  derived  a  formal  notion  of  infinity,  but  it  naively  established  conjectures 
like  “A  set  can  never  be  a  member  of  itself”  and  procedures  Tor  making 
chains  of  new  sets  (“Insert  a  set  into  itself”).  No  sophisticated  set  theory 
(e.g.,  diagonalization)  was  ever  done. 

After  this  initial  period  of  exploration,  AM  decided  that  “equality”  was 
worth  generalizing  and  thereby  discovered  the  relation  “same  size  as."  Natural 
numbers  were  based  oil  this  discovery,  and,  soon  after,  most  simple  arithmetic 
operations  were  defined. 

Since  addition  arose  as  an  analogue  to  union,  and  multiplication  as  a 
repeated  substitution,  it  came  as  quite  a  surprise  when  AM  noticed  that  they 
were  related  (namely,  N  +  N  =  2  X  /V).  AM  later  rediscovered  multiplication 
in  three  other  ways:  as  repeated  addition,  as  the  numeric  analogue  of  the 
Cartesian  product  of  sets,  and  using  the  cardinality  of  the  power  set  of  the 
union  ot  two  sets. 

Raising  to  fourth-powers  and  taking  fourth-roots  were  discovered  at  this 
time.  Perfect  squares  and  perfect  fourth-powers  were  isolated.  Many  other 
numeric  operations  and  kinds  of  numbers  were  found  to  be  of  interest:  odds, 
evens,  doubling,  halving,  integer  square  root,  and  so  on.  Although  it  isolated 
the  set  of  numbers  that  had  no  square  roots,  AM  was  never  close  to  discovering 
rationals,  let  alone  irrationals.  No  notion  of  “closure”  was  provided  to — or 
discovered  by — AM 

The  associativity  and  commutativity  of  multiplication  indicated  to  AM 
that  it  could  accept  a  bag  of  numbers  as  its  argument.  When  AM  defined 
the  inverse  operation  corresponding  to  “times,"  this  property  allowed  the 
definition  to  be:  “any  bag  of  numbers  greater  than  l  whose  product  is  z.”  This 
was  just  the  notion  of  factoring  a  number  z.  Minimally  factorable  numbers 
turned  out  to  be  what  we  call  primes.  (Maximally  factorable  numbers  were 
also  thought  to  be  interesting.) 

Prime  pairs  were  discovered  in  a  bizarre  way:  by  restricting  the  domain 
and  range  of  addition  to  primes  (i.o.,  solutions  of  p  +  q  —  r  in  primes). 

AM  conjectured  the  fundamental  theorem  of  arithmetic  (unique  factoriza¬ 
tion  into  primes)  and  Goldbach’s  conjecture  (every  even  number  greater  than 
2  is  the  sura  of  two  primes)  in  a  surprisingly  symmetric  way.  The  unary 
representation  of  numbers  gave  way  to  a  representation  as  a  bag  of  primes 
(based  on  unique  factorization),  but  AM  never  came  up  with  exponential  nota¬ 
tion.  Since  the  key  concepts  of  remainder,  greater  than,  greatest  common 
denominator,  and  exponentiation  were  never  mastered,  progress  in  number 
theory  was  arrested. 

When  a  new  base  of  geometric  concepts  was  added,  AM  began  finding 
some  more  general  associations.  In  place  of  the  strict  definitions  for  the 


D4c 


AM 


449 


equality  of  lines,  angles,  and  triangles  dime  new  definitions  of  concepts  com¬ 
parable  to  parallel,  equal  measure,  similar,  congruent,  translation,  and  rota¬ 
tion,  together  with  many  that  have  no  common  name  (c.g.,  the  relationship 
of  two  triangles  sharing  a  common  angle).  A  clever  geometric  interpreta¬ 
tion  of  Goldbach's  conjecture  was  found:  Given  all  angles  of  a  prime  num¬ 
ber  of  degrees  (0°,  i°,2°, 3°, 5°, 7°,  11°,  ...,179°),  any  angle  between  0  and 
180  degrees  can  be  approximated  (to  within  1°)  as  the  sum  of  two  of  those 
angles.  Lacking  a  geometry  “model”  (an  analogical  representation  like  the 
one  Celcrnter,  1963,  employed;  see  Article  II.D3,  in  Vol.  l),  AM  was  doomed  to 
propose  many  implausible  geometric  conjectures  (see  Article  II1.C5,  in  Vol.  l). 

Perhaps  a  full  appreciation  for  the  depth  of  AM’s  search  of  the  concept 
space  can  be  gained  by  examining  Figure  D4e-3,  which  shows  the  derivation 
path  for  prime  numbers.  It  is  eight  levels  deep  and  requires  M  concept- 
creation  operations.  This  derivation  is  quite  impressive,  both  because  of  its 
depth,  and  because  the  final  concept  is  so  far  removed  semantically  from 
the  initial  concepts.  Note,  in  particular,  the  fascinating  way  in  which  a  new 
concept,  SELF -COM POSE,  is  used  as  a  new  operator  to  derive  TIMES21  and 
T1MES22.  AM  is  able  to  search  in  a  highly  directed,  rational  fashion. 

Evaluating  AM 

It  is  important  to  ask  how  general  the  AM  program  is:  Is  the  knowledge 
base  “just  right”  (i.e.,  finely  tuned  to  elicit  this  one  chain  of  behaviors)? 
The  auswer  is  no:  The  whole  point  of  this  project  was  to  show  that  a  rela¬ 
tively  small  set  of  general  heuristics  can  guide  a  nontrivial  discovery  process. 
Keeping  the  program  general  and  not  finely  tuned  was  a  key  objective.  Each 
activity  or  task  was  proposed  by  some  heuristic  rule  (like  “Look  for  extreme 
cases  of  X”)  that  was  used  time  and  time  again,  in  many  situations.  It  was 
not  considered  fair  to  insert  heuristics  that  provide  guidance  in  only  a  single 
situation.  For  example,  the  same  heuristics  that  lead  AM  to  decompose  num¬ 
bers  (using  TIMES-inversc)  and  thereby  discover  unique  factorisation,  also  lead 
to  decomposing  numbers  (using  ADD-invcrse)  and  the  discovery  of  Goldbach’s 
conjecture. 

AM  does,  however,  have  some  weaknesses.  Although  AM  was  able  to 
discover  and  refine  many  interesting  new  concepts,  it  had  no  way  of  improving 
its  stock  of  heuristic  rules.  Consequently,  as  AM  ran  longer  and  longer,  the 
concepts  it  defined  were  further  and  further  from  the  primitives  it  began 
with,  and  the  efficacy  of  its  fixed  set  of  heuristics  gradually  declined.  Lcnat 
(1980)  has  proposed  a  solution  to  this  problem.  Me  advocates  turning  each 
heuristic  rule  into  a  concept  and  developing  additional  operators  for  creating 
new  heuristics.  The  EUltlSKO  project  is  presently  pursuing  this  research. 

A  deeper  problem  has  to  do  with  some  of  the  characteristics  of  the  domain 
of  mathematics  that  may  not  hold  in  other  domains.  One  important  fact 
about  elementary  mathematics  is  that  the  density  of  interesting  concepts 
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is  quit«  high.  AM  relics  on  the  ability  to  build  up  complex  concepts  from 
more  primitive  concepts  in  a  step-by-step  fashion.  At  each  step,  the  partial 
concepts  must  appear  to  AM  to  be  interesting.  In  many  domains,  however, 
it  is  not  possible  to  assess  the  interestingness  of  partial  solutions.  Consider, 
for  example,  the  problem  of  credit  assignment  in  a  game  such  as  chess.  For  a 
novice  chess  player,  it  is  necessary  to  play  an  entire  game  before  receiving  any 
feedback  on  the  quality  of  individual  moves.  Even  as  a  player  becomes  expert, 
it  is  still  uecessary  to  searen  several  moves  in  advance  in  order  to  evaluate  a 
particular  choice.  Future  efforts  to  develop  AM-stylc  discovery  systems  in 
other  domains  may  face  difficulties  in  evaluating  the  worth  of  concepts.  More 
sophisticated  intcrestingness  heuristics  may  need  to  be  developed.  Work  on 
the  EUUISKO  project  may  provide  some  answers  to  these  questions. 


Conclusion 

AM  is  a  powerful  discovery  system  that  investigates  and  refines  concepts 
in  elementary  set  and  number  theory.  It  begins  with  a  large  body  of  knowledge 
about  what  kinds  of  concepts  are  mathematically  interesting  and  how  they 
can  be  synthesixed  from  existing  concepts.  This  knowledge  can  then  carry 
AM  far  beyond  its  initial  store  of  concepts  to  discover  prime  numbers  and  the 
fundamental  theorem  of  arithmetic. 

Rtference* 

Lenat  (1070)  provides  complete  details  on  AM;  see  also  Lenat  (1977). 
Lenat  (1980)  describes  the  EURISKO  project. 
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D5.  Learning  to  Perform  Multiple-step  Tasks 


MOST  of  the  learning  programs  discussed  so  far  in  this  chapter  were  designed 
to  learn  how  to  perform  aingle-atep  tanka — that  is,  tasks  in  which  one  rule,  or  a 
set  of  independent  rules,  can  be  applied  in  one  step  to  accomplish  the  perfor¬ 
mance  task.  In  pattern  classification  (Article  XIV. D2)  and  single-concept  learn¬ 
ing  (Sec.  XIV. 03),  the  performance  element  takes  an  unknown  object  or  pattern 
and  assigns  it  to  one  of  two  classes  (e.g.,  an  arch  or  a  “notiareh").  These  sys¬ 
tems  apply  a  single  classification  rule,  or  concept,  to  perform  the  classification. 
Even  the  sequence-extrapolation  problems  addressed  by  BACON  (Article 
XIV.DSb)  and  SPARC  (Article  XlV.D.ld)  involve  applying  a  single  rule  to  predict 
the  next  item  in  the  sequence  from  the  previous  items.  Similarly,  in  the 
multiple-rule  tasks  of  soybean-disease  diagnosis  (Article  XIV. D4»)  and  mass- 
spectrometry  simulation  (Article  XIV.D4b),  several  rules  arc  applied  in  parallel 
to  determine  the  unknown  disease  or  to  predict  how  the  unknown  molecule 
will  break  apart. 

Multiple- ate p  Toaka 

In  contrast,  this  section  surveys  a  few  learning  systems  that  learn  how 
to  perform  multiple-atep  toaka — that  is,  tasks  in  which  several  rules  must  be 
chained  together  into  a  sequence.  Examples  or  multiple-step  tasks  include 
the  game  of  checkers,  in  which  rules  for  making  individual  moves  must  be 
chained  together  to  play  a  whole  game,  and  symbolic  integration,  in  which 
several  rules  of  integration  must  be  applied  sequentially  to  solve  each  integral. 
The  goal  of  the  learning  system  is  to  acquire  a  good  set  of  rules  for  performing 
these  tasks. 

Multiple-step  tasks  arc  essentially  planning  tanka  in  which  the  perfor¬ 
mance  element  must  find  a  sequence  of  operators  to  get  from  some  starting 
state  (e.g.,  the  opening  position  in  checkers)  to  some  goal  state  (e.g.,  a  won 
game).  The  chapters  on  search  (Chap.  II,  in  Vol.  i)  and  planning  (Chap.  XV) 
describe  various  methods  that  have  been  used  to  accomplish  this  atote-apoee 
acareh(sea  Article  H.C3,  in  Vol.  l).  So  far,  AI  learning  systems  have  been  devel¬ 
oped  only  for  simple,  forward-chaining  planning  programs.  No  attempts  have 
been  made  to  learn  how  to  perform  hierarchical  or  constraint-based  planning. 

Viewing  the  Performance  Element  aa  a  Production  System 

The  first  four  systems  described  in  this  section — Samuel’s  (1959)  checkers 
player,  Waterman’s  (1970)  poker  player,  Sussman’s  (1975)  HACKER  planning 
system,  and  Mitchell’s  LEX  system  for  symbolic  integration  (Mitchell,  UtgolF, 
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and  Baner;;,  in  press) — arc  alt  simple,  forward-chaining  problem  solvers  and, 
thus,  can  be  viewed  at  simple  production  systems.  The  grammatical-inference 
systems  discussed  in  the  fifth  article  (Article  XIV. D5e)  employ  context-free 
grammars,  which  can  also  be  considered  production  systems.  The  knowledge 
base  for  each  of  these  systems  contains  a  set  of  production  rules  of  the  form: 

(situationi)  =*  (actioni) 

(situations)  =*♦  (actions) 

(situation,,)  =*»  (action,,) . 

The  performer  ^e  clement  repeatedly  selects  a  rule  whose  situation  part  (left- 
hand  side)  matches  the  current  state  and  applies  the  rule  by  performing  the 
action  indicated  (right-hand  side).  The  action  usually  has  the  effect  of  moving 
the  performance  element  to  a  new  state,  closer  to  the  goal. 

F6r  most  of  the  programs  discussed  in  this  section,  the  possible  actions 
are  provided  in  advance.  The  problem  addressed  by  the  learning  element  is  to 
determine  under  what  situations  the  actions  should  be  applied.  This  learning 
problem  is  similar  in  many  ways  to  the  problems  addressed  in  Section  XTV.D4 
on  learning  multiple  concepts. 

However,  two  factors  make  this  learning  problem  more  difficult.  First, 
because  the  rules  must  be  chained  together,  the  learning  element  has  to 
consider  possible  interactions  among  the  rules  when  it  modifies  the  knowledge 
base.  In  LEX,  for  example,  the  Icaruing  dement  might  decide  that  in  any 
integral  of  the  form 

J  cf(x)dx , 

the  constant  c  should  always  be  factored  out.  This  is  expressed  in  LEX  as  the 
production  rule 

If  the  integral  has  the  form  /  e/(x)  dx,  then  apply  OP03 , 

where  OP03  converts  /  cf[x)dx  to  c  /  f(x)  dx.  Unfortunately,  if  the  constant 
e  is  0  or  1,  this  is  not  an  advisable  step.  Instead,  OP08  (convert  1  •  f(x)  to  /(x)) 
or  0P15  (convert  0  •  f(x)  to  0)  should  be  applied.  When  LEX  is  teaming  the 
production  rule  for  OP03,  it  must. take  into  account  these  possible  interactions 
with  OP08  and  OP  IS.  In  fact,  LEX's  goal  is  to  discover  the  best  operator  to 
apply  in  every  situation.  Thus,  any  time  more  than  one  operator  is  applicable 
because  of  overlapping  left-hand  sides,  LEX  must  eliminate  the  overlap.  In 
this  ease,  the  appropriate  rule  for  OP03  is: 

If  the  integral  has  the  form  f  e/(x)  dx  Ae^OAryi  1,  then  apply  OPOS. 

This  is  a  particular  instance  of  the  general  problem  of  incorporating  new 
krowlcdge  into  the  knowledge  base  (see  Artir’c  XIV.A). 
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The  second  difficult  aspect  of  multiple-step  tasks  is  the  problem  of  credit 
assignment.  In  single-step  tasks,  the  system  has  available  a  performance 
standard  that  can  be  employed  immediately  after  a  rule  is  applied  to  deter¬ 
mine  whether  or  not  the  rule  is  correct.  In  disease  diagnosis,  for  example, 
the  learning  element  receives  the  correct  disease  classification  along  with  each 
training  instance.  The  performance  element  can  apply  its  diagnosis  rules  and 
receive  immediate  feedback  on  the  correctness  of  those  rules.  The  perfor¬ 
mance  standard  ca.i  even  be  incorporated  directly  into  the  learning  process 
as  in  the  version-space  method,  in  which  the  correct  classification  determines 
how  the  version  space  is  updated. 

In  multiple-step  tasks,  however,  feedback  from  the  performance  standard 
is  not  usually  available  until  the  game  is  completed  or  the  problem  is  solved. 
The  program  can  determine  only  whether  the  entire  sequence  of  rules  was 
good  or  bad.  The  credit-assignment  problem  is  the  problem  of  converting  this 
overall  performance  standard  into  a  performance  standard  for  each  rule.  The 
c»  Tall  credit  or  blame  must  be  parceled  out  somehow  among  the  individual 
rules  that  were  applied. 

The  Importance  of  a  Transparent  Performance  Element 

To  solve  these  problems  of  integration  and  credit  assignment,  it  is  criti¬ 
cally  important  for  the  performance  element  to  lie  transparent.  A  transparent 
performance  element  can  provide  the  learning  clement  with  a  trace  of  all 
actions  that  it  considered,  as  well  as  those  it  actually  performed.  This  allows 
the  learning  clement  to  determine  all  of  the  rules  that  might  have  been  appli¬ 
cable  at  each  step  of  the  problem-solving  process.  Such  information  makes  it 
easier  to  solve  the  problem  of  integrating  new  rules- into  the  knowledge  base. 

A  complete  performance  trace  also  aids  the  credit- assignment  task.  During 
credit  assignment,  it  is  very  useful  to  know  why  the  performance  clement 
chose  the  rules  that  it  did  and  what  it  expected  those  rules  to  do.  By  compar¬ 
ing  the  goals  and  expectations  of  the  performance  clement  with  what  really 
transpired,  credit  and  blame  can  be  assigned  to  individual  decisions. 

Extracting  Local  Training  Instance *  from  the  Performance  Tract 

When  the  learning  system  for  a  multiple-step  task  is  presented  with  a 
training  instance — such  as  a  board  position  in  checkers  and  knowledge  of 
which  side  can  win  from  that  position — it  cannot  immediately  learn  from  the 
training  instance.  Instead,  it  must  actually  perform  the  task  -that  is,  play 
out  the  checkers  game — and  compare  the  result  with  the  information  supplied 
by  the  performance  standard — that  is,  which  side  should  have  won.  During 
credit  assignment,  it  can  actually  decide  which  individual  decisions  were  good 
and  which  bad,  and  these  evaluated  decisions  can  serve  as  training  instances 
for  learning  the  left-hand  sides  of  the  production  rules  in  the  knowledge  base. 


DS 


Learning  to  Perform  Multiple-ttcp  Taaka 


455 


By  performing  the  task  and  assigning  credit  and  blame,  the  “global"  training 
instances  can  be  converted  into  “local”  training  instances. 

For  example,  in  LEX.  a  global  training  instance  consists  of  an  integral 
such  as 

/  2**  dx 

along  with  knowledge  of  whether  or  not  the  integral  can  be  solved.  The 
solution  trace  (see  Fig.  D5-1)  shows  that  0P12  should  not  have  been  applied, 
since  it  leads  to  a  complicated  expression  that  requires  several  more  steps  to 
solve,  but  that  OP03  and  OP02  were  used  correctly. 

Thus,  three  local  training  instances  can  be  extracted: 

J  2x*  dx  *»  OPH  (negative). 

J  2  x*  dx  »*  OPOS  (positive). 

2  J  x*  dx  OP02  (positive) . 

Once  local  training  instances  have  been  extracted,  the  techniques  for 
doing  concept  learning  discussed  in  Sections  XIV.D3  and  XIV.D4  can  be  applied 
to  learn  the  left-hand  sides  of  the  production  rules  in  the  knowledge  base. 
Figure  D5-2  shows  a  slight  perturbation  of  the  simple  learning-system  model 
presented  in  Article  XIV.A.  The  model  now  contains  a  loop  in  which  the 
performance  trace  is  analysed  by  the  learniog  element  to  extract  local  training 
instances.  Global  training  instances  arc  still  supplied  by  the  environment. 


/  2i*  dx 


Figure  D5-1.  A  sample  performance  trace. 
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Figure  D5-2.  A  modified  model  of  learning  systems. 


Outline  of  This  Section 

The  five  systems  presented  in  this  section  all  perform  multiple-step  tasks 
and,  consequently,  must  address  problems  of  integrating  new  rules  and  assign¬ 
ing  credit  and  blame.  Waterman,  and  to  some  extent  Samuel,  simplifies 
the  credit-assignment  problem  by  obtaining  a  move-by-move  performance 
standard  from  the  environment.  Furthermore,  all  of  the  systems,  except 
Waterman's  poker  system,  ignore  the  problem  of  integrating  new  rules  into  the 
knowledge  base.  Work  in  this  area  is  still  in  its  infancy,  and  more  sophisticated 
learning  systems  for  multiple-step  tasks  can  be  expected  in  the  future. 

References 

Buchanan,  Mitchell.  Smith,  and  Johnson  (1977)  provide  another  perspec¬ 
tive  on  the  use  of  feedback  in  learning  systems. 


D5a.  Samuel’s  Checkers  Player 


FROM  1947  to  1967,  Aithur  Samuel  conduct:*!  a  continuing  rcsearrli  project 
aimed  at  developing  a  checkers-playing  program  that  was  able  to  learn  from 
experience.  Samuel  investigated  three  different  representations  for  checkers 
knowledge — memorised  moves,  polynomial  evaluation  functions,  and  signa¬ 
ture  tables — and  two  difTccnt  training  methods — self-play  and  book-move 
learning.  The  work  on  rote  learning  of  checkers  moves  is  discussed  in  Article 
XIV.R2.  The  present  article  discusses  two  specific  learning  situations:  (a)  self¬ 
play  as  it  was  used  to  learn  a  polynomial  evaluation  function  and  (b)  book- 
move  training  as  it  was  used  to  learn  a  set  of  signature  tables.  Samuel 
experimented  with  several  other  combinations  of  training  methods  and  repre¬ 
sentations  (for  more  details,  sec  Samuel,  1959,  1967). 

The  performance  clement  in  all  of  Samuel’s  systems  employs  a  look-ahead, 
game-tree  scar'h  to  determine  which  moves  to  make  (see  Articles  I1.B3  and 
1I.CS,  in  Vol.  l).  The  performance  element  uses  a  static  evaluation  function 
(Article  II.CS)  to  evaluate  possible  future  positions  in  the  game  and  applies 
alpha-beta  minimaxing  to  determine  the  best  move  to  make.  The  goal  of  the 
learning  process  is  to  establish  and  improve  this  static  evaluation  function 
through  experience. 

Learning  a  Polynomial  Evaluation  Function  Through  Self-play 

The  first  static  evaluation  function  investigated  by  Samuel  was  a  poly¬ 
nomial  of  the  form 

value  =  «",•/,• , 

• 

where  /,•  are  board  features  and  tu;  are  real-valued  weights  (coefficients).  For 
most  of  Samuel’s  experiments,  a  polynomial  with  16  features  was  employed. 
Each  board  feature  provides  a  numerical  measure  of  some  aspect  of  the  board 
position  under  evaluation.  For  example,  the  EXCH  feature  measures  the 
relative  exchange  advantage  of  the  player  whose  turn  it  is  to  move.  EXCH 
is  computed  by  taking  TtUTT*nt,  the  total  number  of  squares  into  which  the 
player  to  move  may  advance  a  piece,  and  in  so  doing  force  an  exchange,  and 
subtracting  the  corresponding  quantity  for  the  previous  move  by  the 

opposing  player,  j 

Samuel’s  program  faced  two  tasks  in  attempting  to  learn  such  a  poly¬ 
nomial  evaluation;  function:  (a)  discovering  which  features  to  use  in  the  func¬ 
tion  and  (b)  developing  appropriate  weights  for  combining  the  various  features 
to  obtain  a  value  for  the  board  position.  We  describe  the  weight-learning  task 
first  and  later  return  to  the  problem  of  discovering  which  fi  xtures  to  use. 
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In  the  seif-play  mode  of  training,  thc'checkcrs  program  learns  by  playing 
a  copy  of  itself.  The  version  of  the  program  that  is  doing  the  learning  is 
referred  to  as  Alpha,  while  the  copy  that  serves  as  an  opponent  is  called 
Beta.  The  learning  procedure  employed  by  Alpha  is  to  compare  at  each  turn 
its  estimate  of  the  value  for  the  current  board  position  with  a  performance 
standard  that  provides  a  more  accurate  estimate  of  that  value.  The  difference 
between  these  two  estimates  controls  the  adjustment  of  the  weights  in  the 
evaluation  function.  Alpha’s  estimate  is  developed  by  conducting  a  shallow 
minimax  search  applying  the  evaluation  polynomial  to  tip  board  positions 
and  backing  up  these  values  (see  Article  tl.C5»,  in  Vot.  l).  The  performance 
standard  is  obtained  by  conducting  a  deeper  minimax  search  into  future  board 
positions  using  the  same  evaluation  function  as  in  the  shallow  search.  Samuel 
takes  advantage  of  the  fact  that  a  deep  search  is  usually  more  accurate  than 
a  shallow  one. 

How  docs  Alpha  use  this  move-by-move  performance  standard  to  guide 
its  search  for  proper  weighting  coefficients?  First,  the  difference,  A,  between 
the  performance  standard  and  Alpha’s  estimate  is  computed.  If  A  is  negative, 
Alpha's  polynomial  is  overestimating  the  value  of  the  position.  If  A  is  positive, 
Alpha  is  underestimating  it.  For  each  board  feature,  a  count  is  kept  of  the 
times  that  the  sign  of  that  feature  agrees  or  disagrees  with  the  sign  of  A.  From 
these  tallies,  a  correlation  coefficient  is  developed  that  indicates  the  degree 
to  which  that  feature  predicts  A.  The  goal  of  the  learning  procedure  is  to 
minimise  A  (so  that  Alpha  is  duplicating  the  evaluations  of  the  performance 
standard).  The  weights  of  the  polynomial  are  determined  by  scaling  the 
correlation  coefficients  onto  the  range  — 2l*  to  2‘®.  Large  positive  coefficients 
are  given  to  features  that  strongly  predict  positive  values  of  A  and  vice  versa, 
so  that  the  polynomial  will  tend  to  “follow”  A  and  thus  reduce  it. 

The  overall  effect  of  this  scheme  is  to  independently  assign  blame  for 
Alpha’s  estimation  errors  to  the  individual  features.  This  is  sensible,  since 
the  features  arc  combined. independently  (i.c.,  by  addition,  without  any  inter¬ 
action  terms)  to  form  the  polynomial. 

Alpha  can  be  viewed  as  conducting  a  hill-climbing  search  through  the 
“rule  space” — the  space  of  possible  weights.  Each  move  in  the  checkers 
game  serves  as  a  training  instance  to  guide  this  search.  The  correlation 
coefficients  summarise  the  entire  body  of  training  instances  and  indicate  in 
which  direction  the  search  must  move  in  order  to  minimise  A. 

Hill-climbing  is  known  to  have  many  drawbacks,  including  convergence 
to  local  maxima.  Samuel  addresses  this  problem  as  follows.  When  Alpha  and 
Beta  commence  play,  they  arc  identical.  However,  while  Alpha  proceeds  to 
search  the  rule  space,  Beta  does  not  change.  As  Alpha  improves,  it  begins  to 
defeat  Beta  regularly.  When  Alpha  has  won  a  majority  of  the  games  played, 
Beta  adopts  Alpha’s  improved  evaluation  function,  and  the  count  of  games 
won  and  lost  is  started  again  from  sero.  Beta  is  thus  used  to  “remember”  a 
good  point  in  the  rule  space.  If  Alpha  is  at  a  local  maximum,  however,  its 
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performance  will  tend  to  worsen  whenever  it  make*  a  minor  modification  to  its 
polynomial.  To  prevent  a  local  maximum  from  halting  Alpha's  improvement, 
an  arbitrary  change  is  made  to  Alpha's  scoring  polynomial  whenever  Alpha 
loses  three  games  to  Beta.  The  largest  weight  in  Alpha’s  polynomial  is  set  at 
xero  to  jump  Alpha  to  some  new  point  in  the  rule  space. 

Now  that  we  have  seen  how  Samuel’s  program  determines  the  weights 
for  the  evaluation  polynomial,  we  turn  our  attention  to  the  first  learning 
problem — determining  what  features  should  be  used  to  evaluate  a  board  posi¬ 
tion.  This  is  a  variant  of  the  problem  0/  new  term*  (see  Article  XTV.Ol):  How 
can  a  learning  program  discover  the  appropriate  terms  for  representing  its 
acquired  knowledge?  Samuel  offers  a  partial  solution  to  this  problem,  namely, 
term  selection.  The  learning  program  is  provided  with  a  list  of  38  possible 
terms.  Its  learning  task  is  to  select  a  subset  of  16  of  these  terms  to  include  in 
the  evaluation  polynomial. 

The  selection  process  is  quite  straightforward.  The  program  starts  with 
a  random  sample  of  16  features.  For  each  feature  in  the  polynomial,  a  count 
is  kept  of  how  many  times  that  feature  has  had  the  lowest  weight  (i.e.,  the 
weight  nearest  xero).  This  count  is  incremented  after  each  move  by  Alpha. 
When  the  count  for  some  feature  exceeds  32,  that  feature  is  removed  from  the 
polynomial  and  replaced  by  a  new  term.  At  all  times,  16  features  arc  included 
in  the  polynomial,  and  the  remaining  22  features  form  a  reserve  queue.  New 
features  are  selected  from  the  top  of  the  queue,  while  features  removed  from 
the  polynomial  are  placed  at  the  end  of  the  queue.  Viewed  in  the  context  of 
credit  assignment,  Samuel’s  program  assigns  blame  to  features  whose  weights 
have  values  near  xero,  since  those  features  are  making  no  contribution  to  the 
evaluation  function. 

Samuel  (1950)  was  dissatisfied  with  this  term-selection  approach  to  the 
new-term  problem.  He  writes: 

It  might  be  argued  that  this  procedure  of  having  the  program  select  new 
terms  for  the  evaluation  polynomial  from  a  supplied  list  is  much  too  simple 
and  that  the  program  should  generate  terms  for  itself.  Unfortunately,  no 
satisfactory  scheme  for  doing  this  has  yet  been  devised,  (p.  220) 

The  feature-selection  and  weight-adjustment  learning  processes  take  place 
concurrently.  In  Samuel’s  experiment  with  these  learning  methods,  the  set  of 
selected  features  and  their  weights  started  to  stabilize  after  roughly  32  games 
of  self-play.  The  resulting  program  was  able  to  play  a  “better-than-average” 
game  of  checkers  (Samuel,  1959,  p.  222). 

Learning  a  Signature  Table  by  Book  Training 

The  second  kind  of  static  evaluation  function  investigated  by  Samuel  was 
a  system  of  signature  tables.  A  signature  table  is  an  n-dimensional  array.  Each 
dimension  of  the  array  corresponds  to  one  of  the  measured  board  features. 
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To  obtain  the  estimated  value  of  a  board  position,  we  measure  each  of  the 
board  features  and  index  these  values  into  the  signature-table  array.  The 
contents  of  each  cell  in  the  table  is  a  number  that  gives  the  value  of  the 
corresponding  board  position.  In  a  sense,  the  signature  table  maps  ail  possible 
board  positions  into  a  small  n-dimensioua!  feature  space.  Every  point  in  that 
feature  space  is  represented  as  a  cell  in  the  signature  table  that  gives  the  value 
of  all  board  positions  mapped  to  that  point. 

Suppose,  for  example,  that  we  had  only  three  features:  KCENT  (king 
center  control),  MOB  (total  mobility),  and  GUARD  (back-row  control).  The 
cube  shown  in  Figure  D5a-1  is  a  schematic  diagram  of  the  resulting  signature 
table.  Notice  that  KCENT  and  GUARD  take  on  only  the  values  —1,  0,  and  1, 
while  MOB  is  allowed  to  take  on  values  from  —2  to  +2.  If  we  have  a  board 
position  for  which  KCENT  =  1,  GUARD  =  0,  and  MOB  =  2,  then  we  look  into 
the  signature  table  at  the  cell  addressed  by  (1,0,2)  to  obtain  the  value:  .8. 

It  is  possible  to  view  this  signature  table  as  a  set  of  3  X  3  X  5  = 
-15  production  rules.  There  is  one  rule  for  every  possible  combination  of 
features — every  cell — in  the  table.  The  rule  for  the  situation  illui ‘rated  in 
Figure  D5a-1  could  be  stated  as 

If:  KCENT  e.  1  A  GUARD  »0  A  MOB  —  2 , 

Then:  Value  of  position  »■  .8 . 

Signature  tables  are  more  expressive  than  linear  polynomials  because  they 
can  capture  interactions  among  all  of  the  features.  Their  main  drawbacks, 
however,  arc  their  large  sire  and  related  problems  with  learnability.  A  full 
signature  table  for  the  eutire  set  of  24  terms  used  by  Samuel  would  contain 
roughly  6  X  1011  cells — far  too  large  to  be  stored  or  elfectively  learned.  Two 
techniques  were  applied  to  overcome  these  problems.  First,  the  uumber  of 
possible  values  for  each  feature  was  substantially  reduced.  Most  features  were 
restricted  to  three  values:  +  1  (if  the  position  is  good  for  the  program),  0  (if 
the  position  is  even),  and  —1  (if  the  position  is  bad  for  the  program).  Second, 


Figure  D5a-1.  A  three-dimensional  signature  table. 
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Figure  D5a-2.  Threc-Icvel  hierarchy  of  signature  tables 
(from  Samuel,  1967). 

instead  of  one  giant  signature  table,  Samuel  adopted  the  three-level  hierarchy 
shown  in  Figure  DSa-2. 

The  24  board  features  are  partitioned  into  six  important  subgroups,  and 
a  separate  signature  table  is  developed  for  each  group.  The  outputs  of  the 
six  first-level  signature  tables  are  values  between  —2  and  +2  that  are  used  as 
indexes  to  two  second-level  signature  tables.  The  second-level  tables  produce 

values  between  —7  and  +7  that  are  used  as  indexes  to  the  final  signature _ 

table  to  obtain  the  estimated  value  of  the  board  position.  This  hierarchical 
system  was  found  to  be  expressive  enough  to  support  excellent  checkers  play 
and  small  enough  to  be  Icarnable. 

The  program  learns  the  values  for  the  cells  in  these  tables  by  following 
“book  “aincs"  played  between  two  master  checkcrs-players.  Approximately 
250, (MX#  board  situations  of  master  play  were  presented  to  the  program.  Most 
of  these  moves  were  selected  from  games  ending  in  a  draw.  The  program 
operates  as  follows.  Each  cell  in  the  signature  tabic  is  associated  with  two 
counts,  called  A  (agree)  and  D  (differ).  Initially,  A  and  D  are  xcro  for  each 
cell.  At  each  move,  the  program  is  faced  with  a  set  of  alternative  moves,  one 
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of  which  is  the  book-designated  move.  Each  of  these  possible  moves  can  be 
mapped  into  one  cell  in  each  signature  table.  The  program  adds  a  one  to  the 
D  count  of  each  cell  whose  corresponding  move  was  not  the  book-preferred 
move.  A  total  of  n  (where  n  is  the  number  of  nonbook  moves)  is  added  to  the 
A  count  of  each  cell  corresponding  to  the  book-preferred  move.  Periodically, 
the  contents  of  the  signature-table  cells  themselves  are  updated  to  reflect  the 
A  and  D  counts.  Each  cell  is  given  the  value 

r  M  -  D) 

[A  +  oy 

which  is  a  rough  correlation  coeflicient  indicating  the  extent  to  which  the 
board  positions  mapped  to  that  cell  are  the  book-preferred  moves.  The 
correlation  coefficients  arc  then  scaled  into  the  —2  to  +2  (or  —7  to  +7)  range. 

This  learning  process  can  be  viewed  as  a  technique  of  learning  from 
examples.  Each  move  provides  a  training  instance  that  is  used  to  update 
several  signature-tabic  entries.  Credit  assignment  is  easy,  because  the  book 
provides  a  fairly  reliable  performance  standard  on  a  move-by-move  basis. 
Credit  is  assigned  to  the  signature-table  cell  corresponding  to  the  book  move, 
and  blame  is  allotted  to  all  cells  corresponding  to  rejected  alternative  moves. 
It  is  the  learning-by-doing  approach  that  allows  the  program  to  determine 
which  moves  arc  the  alternative  moves. 

The  second-  and  third-level  tables  are  trained  at  the  same  time,  and  by 
the  same  techniques,  as  the  first-level  tables.  The  current  contents  of  the 
signature  tables  are  used  to  determine  which  second-  and  third-level  cells 
correspond  to  the  alternative  moves  under  consideration,  and  their  A  and  D 
totals  are  updated  during  each  move.  The  learning  process  is  quite  erratic 
at  the  start,  since  most  of  the  first-level  signature- table  cells  contain  zeros 
initially.  Thus,  incorrect  second-  and  third-level  cells  arc  selected  during  the 
early  stages  of  learning.  As  learning  progresses,  these  errors  are  overcome. 

To  make  the  tables  more  reliable  during  the  early  stages  of  training, 
some  smoothing  is  done  to  fill  in  cells  for  which  the  A  and  D  counts  arc  still 
near  zero.  Smoothing  is  a  form  of  generalization  involving  interpolating  and 
extrapolating  from  surrounding  cells  in  the  table.  The  smoothing  has  no  effect 
on  the  /I  and  D  counts — these  are  used  later  to  replace  the  interpolated  values 
with  more  accurate,  induced  values. 

One  other  refinement  of  the  signature-tabic  system  is  to  break  the  game 
of  checkers  into  seven  chronological  phases  and  to  use  a  different  signature 
table  for  each  phase.  Samuel  reasoned  that  the  board  features  relevant  to 
determining  good  moves  during  the  opening  of  the  game  are  unlikely  to  be  the 
same  as  those  used  during  the  ends  of  games.  The  seven-phase  approach  leads 
to  an  increase  in  the  number  of  cells,  thus  making  the  tables  more  difficult  to 
learn.  However,  Samuel  was  able  to  fill  in  empty  cells  by  smoothing  from  the 
tables  of  adjacent  phases. 
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Result t 

Samuel’s  signature-table  system  was  much  more  effective  as  a  checkers 
player  than  any  of  the  other  configurations  he  tested.  To  assess  the  goodness  of 
play,  Samuel  tested  the  program  on  895  book  moves  that  were  not  used  during 
the  training.  A  count  was  made  of  the  number  of  times  that  the  program 
rated  0,  1,  2,  etc.,  moves  as  equal  to  or  better  than  the  book-recommended 
move.  After  training  on  173,989  book  moves,  the  test  gave  the  results  shown 
in  Table  D5a-1.  By  summing  the  first  two  columns,  we  see  that  the  program 
chooses  the  best  move  or  the  second-best  move,  as  defined  Oy  the  book, 
64%  of  the  time.  These  ratings  are  made  without  employing  any  forward 
search.  Minimax  look-ahead  search  improves  the  performance  of  the  program 
substantially. 

Despite  this  impressive  level  of  performance,  champion  checkers  players 
arc  still  able  to  beat  the  program.  In  1965,  the  world  champion,  W.  I1'.  Heilman 
won  all  four  correspondence  games  played  against  the  program.  He  drew  with 
the  program  during  one  “hurriedly  played  cross-board  game”  (Samuel,  1967, 
p.  601,  n.  2). 

Comparison  of  the  Signature -table  and  Polynomial  Methods 

The  signature-table  method  substantially  outperformed  the  polynomial- 
evaluation-function  approach.  Even  when  both  methods  were  trained  by 
following  book  moves,  the  moves  chosen  by  the  polynomial  evaluation  function 
correlated  with  the  book-indicated  moves  only  half  as  well  as  the  moves  chosen 
by  the  signature  tables.  This  difference  is  due  to  the  improved  representational 
power  of  the  signature  tables.  The  signature  table  can  represent  nonlinear 
relationships  among  the  various  terms,  since  there  is  a  different  table  cell 
for  each  possible  combination  of  terms.  In  the  polynomial  representation, 
only  linear  relationships  are  possible.  Such  a  representation  assumes  that 
each  term  contributes  independently  to  the  value  of  a  board  position.  This 
assumption  is  evidently  incorrect  for  checkers. 

Conclusion 

Samuel  developed  and  tested  several  different  representations  and  training 
techniques  for  teaching  a  program  to  play  checkers.  Among  the  contributions 

TABLK  D 5a- 1 

Evaluation  of  Signature-table  Performance 

Number  of  moves  rated 
as  better  than  or 

equal  to  book  move  0  l  2  3  4  5  S 

Relative  proportion  38%  26%  16%  10%  6%  3%  1% 
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of  this  work  are  (a)  the' demonstration  tiiat  machine-learning  techniques  can 
be  highly  successful,  (b)  the  technique  of  using  a  deeper  search  and  book- 
supplied  moves  to  solve  the  credit-assignment  problem,  (c)  the  term-seicction 
methods  for  determining  which  features  to  include  in  the  polynomial  evalua¬ 
tion  function,  and  (d)  the  demonstration  that  signature  tables  provide  a  much 
more  effective  representation  for  checkers  knowledge  than  cither  the  linear- 
polynomial  or  the  rote-learning  techniques. 

Reference » 

All  of  this  work  is  discussed  in  Samuel  (1959,  1967).  See  Buchanan, 
M.tchell,  Smith,  and  Johnson  (1977)  for  a  discussion  of  Samuel's  terra-selection 
technique  as  an  instance  of  a  layered  learning  system. 


D5b.  Waterman’s  Poker  Player 


AS  PART  of  his  thesis  project,  Donald  Waterman  (1968)  developed  a  computer 
program  that  learns  to  play  draw  poker.  Draw  poker  is  a  game  of  imperfect 
information  in  which  psychological  factors,  such  as  how  easily  one’s  opponent 
is  blufIVd,  become  important.  Minimax  look-ahead  search  is  not  possible 
because  the  overall  state  of  the  game  (i.e.,  the  contents  of  all  the  hands) 
is  not  completely  known.  Instead,  approximate  heuristic  methods  must  be 
used.  Waterman  developed  a  production  system  (sec  Article  lll.CM,  in  Vol.  l)  to 
encode  a  set  of  heuristics  for  poker,  and  he  sought  to  have  his  program  discover 
these  production  rules  through  experience.  In  this  article,  wc  first  describe 
Waterman's  production- rule  knowledge  representation  and  its  application  in 
the  poker-playing  performance  clement;  wc  then  discuss  in  detail  the  methods 
used  in  the  learning  clement  to  acquire  and  refine  these  production  rules. 

Waterman Performance  Element  for  Draw  Poker 

Each  game  of  draw  poker  is  divided  into  Bve  stages.  First,  each  player 
is  dealt  five  cards.  This  is  followed  by  a  betting  stage  in  which  the  players 
alternately  choose  to  place  a  bet  larger  than  the  opponent's  bet  (RAISE),  place 
a  bet  equal  to  the  opponent's  bet  (CALI,),  or  give  up  (DROP)  the  hand;  a  CALL 
or  DROP  action  ends  this  stage.  In  the  third  stage,  each  player  has  the  option 
of  replacing  up  to  three  of  his  (or  her)  cards  with  new  cards  drawn  from  the 
deck.  This  is  followed  by  another  betting  stage  like  the  first.  Finally,  the 
hands  arc  compared  (except  in  a  DROP),  and  the  player  with  the  best  hand 
wins  the  game. 

Waterman’s  performance  clement  has  built-in  routines  for  carrying  out 
the  deal,  the  draw,  and  the  final  comparison  of  hands.  The  two  betting 
stages,  however,  arc  performed  by  a  modifiable  production  system.  It  is  the 
production  rules  making  up  this  production  system  that  the  program  attempts 
to  learn  and  improve. 

The  production  system  developed  by  Waterman  contains  two  basic  kinds 
of  rules:  interpretation  rates  that  compute  important  features  of  the  game 
situation  and  action  rules  that  decide  which  action  (CALL,  DROP,  or  RAISE) 
to  take. 

The  action  rules  make  their  decisions  based  on  the  values  of  seven  key 
variables  that  make  up  the  so-called  dynamic  state  vector: 

(VDBAND.  POT.  USTBET,  BLUFFO,  POTBET,  ORP,  OSTTLE) . 

VDHAND,  for  example,  c  a  measure  of  the  value  of  the  program’s  hand,  POT  is 
the  current  amount  of  money  in  the  pot,  and  BLUFFO  is  an  estimate  of  the 
opponent’s  “blu  (lability.” 
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The  interpretation  rules  compute  the  values  of  these  seven  variables  from 
directly  observable  quantities.  To  compute  the  value  of  BLUFFO,  for  example, 
features  such  as  OBLUFFS  (the  oumber  of  times  the  opponent  has  been  caught 
bluffing)  and  OCORREL  (the  correlation  between  the  opponent’s  hands  and  his 
bets)  are  examined.  Once  numeric  values  for  the  seven  variables  have  been 
computed,  they  are  converted  into  symbolic  values  that  describe  important 
subranges  of  values.  For  example,  the  rule 

If  POT  >  50,  than  POT  =  BICPOT. 

gives  POT  the  symbolic  value  BIOPOT  whenever  POT  is  larger  than  50. 

The  action  rules  are  stated  solely  in  terms  of  these  symbolic  values.  A 
typical  action  rule  is 

(SURER IN,  BIGPOT.  POSITIVEBET,  •.  •.  •.  •) 

*=*  (•.  POT  ♦  (2  X  LASTBET)  ,  0.  •,  •,  •,  •)  CALL, 

which  can  be  paraphrased  as 

If:  TOHANO  =  SURER  IB 

and  POT  =  BIGPOT 

and  LASTBET  =  POSITIVEBET, 

Then:  POT  :=  POT  ♦  <2  X  LASTBET) 

LASTBET  :*■  0 

CALL. 

The  condition  and  action  parts  of  the  rule  have  the  same  form  as  the  state 
vector.  The  left-hand  side  of  the  rule  is  a  pattern  that  is  matched  against 
the  state  vector  to  determine  whether  the  rule  should  be  executed.  The  right- 
hand  side  of  the  rule  indicates  which  action  to  take  and  provides  instructions 
for  modifying  the  value  of  the  state  vector. 

These  production  rules  are  applied  by  the  performance  clement  as  follows. 
First,  all  of  the  interpretation  rules  are  used  to  analyse  the  current  game 
situation  in  order  to  develop  the  dynamic  state  vector.  Next,  the  action 
rules  arc  examined  one  by  one  in  a  fixed  order  until  a  rule  is  found  whose 
condition  pattern  matches  the  state  vector.  That  rule  is  executed  to  make 
the  program's  move.  This  fixed  ordering  for  the  production  rules  serves  as 
a  conflict-resolution  technique  (see  Article  Kl.ct,  in  Vol.  f).  If  more  than  one 
rule  is  applicable  in  a  given  situation,  only  the  first  rule  in  the  list  is  executed. 
Hence,  when  new  rules  are  acquired  or  old  rules  arc  modified,  the  order  of  the 
rules  must  be  carefully  considered. 

There  arc  two  basic  ways  to  generalise  the  left-hand  side  of  an  action  rule. 
One  method  is  to  drop  a  condition  by  replacing  one  of  the  symbolic  values 
on  the  left-hand  side  (e.g.,  BIGPOT)  by  *,  which  matches  any  value.  The  other 
method  is  to  modify  the  interpretation  rule  that  defines  a  symbolic  value  so 
that  it  includes  a  larger  set  of  underlying  numeric  values  (e.g.,  changing  BIGPOT 
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to  be  any  POT  >  43).  This  is  the  same  as  Miehalski’s  method  of  generalising 
by  internal  disjunction  (rce  Article  XlV.Dt).  Wc  will  sec  below  how  Waterman 
makes  use  of  these  two  generalization  methods. 

Learning  to  Play  Poker 

Waterman  sought  to  bave  the  program  learn  the  interpretation  rules,  the 
action  rules,  and  the  ordering  of  the  art  ion  rules  by  playing  poker  games 
against  an  expert  opponent.  As  the  poker  games  proceed,  the  learning  element 
analyzes  each  of  the  decisions  of  the  performance  element  and  extracts  train¬ 
ing  instances.  Kach  training  instance  is  in  the  form  of  a  training  rule,  that  is, 
a  specific  production  'ule  that  would  have  made  the  correct  decision  had  it 
been  chosen  and  executed.  The  training  rules  guide  the  learning  element  as 
it  determines  which  production  rules  to  generalize  and  specialize. 

The  task  of  extracting  a  training  rule  is  quite  difficult,  .ecause  the  envi¬ 
ronment  provides  very  little  information  that  could  serve  as  a  performance 
standard.  Unlike  deterministic  games  such  as  checkers  or  chess  that  have 
no  chance  element,  poker  is  probabilistic.  Even  an  expert  player  will  lose 
from  time  to  time.  Thus,  the  program  must  play  several  hands  before  it  can 
assess  the  quality  of  the  production  rules  in  its  knowledge  base.  As  discussed 
in  the  introduction  to  this  section  (Article  XIV. D5),  however,  even  when  a 
reliable  performance  standard  is  available  on  a  full-game  basis,  the  problem 
of  assigning  credit  or  blame  to  ind  vidual  moves  in  that  game  is  still  very 
difficult.  Consequently,  Waterman  sought  to  provide  the  program  with  some 
form  of  move-by-movc  performance  standard.  Three  different  techniques  were 
developed  advice-taking,  automatic  training,  and  analytic  training. 

In  advice-taking,  the  program  plays  a  scries  of  poker  games  against  a 
human  expert.  After  each  turn  by  the  performance  element,  the  learning  ele¬ 
ment  asks  the  expert  whether  the  performance-element  action  is  correct.  The 
expert  responds  cither  with  (OK)  or  with  some  advice  such  as  (CALL  BECAUSE 
TOUt  BAND  IS  FAIR.  THE  POT  IS  LARGE.  AND  THE  LASTBET  IS  URCE).  This  ad¬ 
vice  provides  the  training  rule  directly. 

In  the  automatic-training  approach,  an  expert  program  serves  as  the 
opponent  and  advice-giver.  The  expert  program  uses  a  Knowledge  base  of 
production  rules  developed  by  Waterman  himself  to  determine,  at  each  move, 
what  action  to  take.  During  play  against  the  learning  program,  the  expert 
program  compares  each  move  made  by  the  learning  program  with  the  move 
it  would  have  made  and  provides  advice  exactly  as  a  human  expert  would. 

Finally,  the  most  intc.  eating  method  of  instruction,  the  analytic  method, 
involves  no  advice-taking  whatsoever.  After  each  full  round  of  play  (i.e.,  each 
single  hand),  the  learning  element  analyzes  the  moves  made  by  the  perfor¬ 
mance  element  and  attempts  to  deduce  which  moves  were  incorrect.  In 
place  of  an  externally  supplied  performance  sta-dard,  the  learning  element  is 
provided  with  a  predicate-calculus  axioiuatization  of  the  rules  of  poker.  From 
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these  axioms,  the  program  is  able  to  deduce,  after  the  hand  is  over,  what  the 
correct  decisions  would  have  been,  thus  providing  the  learning  element  with 
a  performance  standard. 

Once  the  learning  clement  has  a  move-by-move  performance  standard,  it 
can  extract  a  training  rule  and  modify  the  production  system.  The  modifier- 
tion  process  works  by  first  locating  the  production  rule  that  made  the  incorrect 
decision  and  then  examining  the  list  of  production  rules  for  a  rule  before  or 
after  the  error-causing  rule  that  could  have  made  the  correct  decision.  If 
such  a  rule  is  found,  generalisation  and  specialization  techniques  are  applied 
to  modify  the  production  rules  so  that  the  proper  rule  would  have  been  exe¬ 
cuted.  If  no  such  rule  is  found,  the  training  rule  itself  is  inserted  into  the 
production-rule  list  immediately  in  front  of  the  error-causing  rule. 

In  the  remainder  of  this  article,  we  discuss  how  each  of  these  three  training 
techniques  allows  the  learning  element  to  develop  a  training  rule.  For  the 
advice-taking  and  automatic-training  methods,  this  is  straightforward.  In  the 
analytic  approach,  however,  a  series  of  credit-assignment  problems  must  be 
solved.  We  describe  Waterman's  solulic.  s  in  detail.  Finally,  we  describe  how 
the  training  rule  acquired  by  any  one  of  these  methods  is  used  to  modify  the 
current  set  of  production  rules  in  the  knowledge  base. 

Advice-taking  and  Automatic  Training 

In  the  advice-taking  and  automatic-training  methods,  the  program  is 
supplied  after  each  move  with  advice  such  as: 

(CALL  BECAUSE  TOUR  BAND  IS  FAIR,  THE  POT  IS  UICE. 

AND  THE  LASTBET  IS  LARGE)  . 

This  advice  provides  the  training  rule  directly.  The  proper  action  (i.e.,  the 
right-hand  side  of  the  training  rule),  CALL,  is  indicated  along  with  the  relevant 
variables  and  their  values.  This  advi-c  is  equivalent  to  the  production  rule: 

(FAIR.  LARGE,  LARGE,  •.  «,  •,  •) 

(•,  POT  ♦  (2  X  LASTBET).  0,  •,  •.  •,  •)  CALL. 

The  details  of  the  right-hand  side  of  the  rule  can  be  filled  in  automatically 
for  each  action  from  knowledge  of  the  rules  of  the  game.  In  this  case,  for 
example,  CALL  requires  the  program  to  match  its  opponent’s  bet,  and  thus  the 
POT  must  increase  by  twice  LASTBET,  once  for  the  opponent’s  bet  and  again 
for  the  program's  reply.  The  other  possibilities,  DROP  and  RAISE,  arc  handled 
similarly. 

It  is  interesting  to  note  that  Waterman’s  program  accepts  fairly  low-level 
advice.  The  expert's  advice  can  easily  be  interpreted  in  terms  of  tiie  present 
game  situation,  so  there  is  no  need  to  interpret  or  operationalize  the  advice 
(see  Article  XIV. Cl).  Waterman’s  advice-taking  research  concentrates,  instead, 
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on  the  problem  of  integrating  this  advice  into  the  current  knowledge  base. 
We  describe  how  this  happens  after  we  discuss  the  methods  employed  during 
analytic  training  to  obtain  the  training  rule. 

Learning  by  the  Analytic  Technique 

The  main  difficulty  facing  Waterman's  program  during  analytic  training 
•  is  credit  assignment.  The  learning  element  has  to  deal  with  a  pair  of  credit- 
assignment  problems.  The  fint  problem  is  determining  the  quality  of  a  round 
of  flay.  As  we  mentioned  above,  the  probabilistic  nature  of  draw  poker  makes 
this  difficult,  since  the  loss  of  a  single  hand  docs  not  necessarily  indicate  that 
the  program  is  playing  poorly.  Furthermore,  the  fact  that  poker  is  a  game 
of  imperfect  information  leads  to  difficulties.  If,  for  example,  the  program 
‘‘drops”  its  bid  (i.e.,  folds  its  hand  and  gives  in  to  the  other  player),  the 
contents  of  the  opponent’s  hand  are  never  known.  The  program  solves  this 
first  credit-assignment  problem  by  always  “calling”  the  bid  (i.e.,  meeting  the 
opponent’s  bet  and  requesting  to  see  his  hand),  instead  of  dropping,  and  by 
applying  its  knowledge  of  the  rules  of  poker  to  deduce  whether  the  program 
could  have  improved  its  play  within  the  round. 

If  the  program  could  have  done  better,  it  turns  its  attention  to  the  second, 
credit-assignment  problem — determining  which  individual  move*  were  poor' 
During  the  round  of  play,  a  complete  trace  of  the  actions  of  the  performance 
element  is  kept.  To  solve  the  second  credit-assignment  problem,  the  learning 
element  applies  its  axiomatisation  of  the  rules  of  poker  to  evaluate  each  move 
in  detail.  The  rules  of  poker  are  axiomatised  in  predicate  calculus  as  a  set  of 
implications  such  as: 

ACTION  (CALL)  A  HICHEKTOUBHAMD,  OPPHAID) 

D  ADD (LASTBET .  POT)  A  AJ>D(POT.  TO0ISCOIE)  . 

These  statements  define  the  effects  of  each  of  four  possible  actions:  BET  IICV, 
BET  LOB,  CALL,  and  DIOP.  To  evaluate  a  particular  move  in  the  game,  the 
learning  element  takes  the  value  of  the  dynamic  state  vector  at  that  point  and 
uses  it  to  determine  the  truth  value  of  certain  predicates  in  this  axiom  system 
(e.g.,  GOOD  (OPPHAID) .  IICIEB  (OPPHAID.  TOUIHAID) ).  Then  it  tries  to  prove  the 
statement 

MAXIMIZE (TOUISCOBE) 

by  backward-chaining  through  the  axiom  system  (see  Article  II1.C4,  in  Vol.  l). 
The  resulting  proof  indicates  the  action  that  should  have  been  performed  and 
provides  the  move-by-movc  performance  standard.  When  the  performance 
standard  differs  from  the  move  made  by  the  program,  blame  is  assigned  to 
that  move,  and  the  barning  clement  builds  a  training  rule. 

The  correct  decision,  obtained  from  the  performance  standard,  forms  the 
right-hand  side  (action  part)  of  the  training  rule.  Waterman  axiomatised  the 
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RAISE  action  as  two  possible  subactions,' BET  HIGH  and  BET  LOW,  so  that  the 
program  would  not  have  to  learn  how  big  a  bet  to  make.  For  BET  BIGB,  the 
performance  element  chooses  a  random  bet  between  10  and  20.  Similarly,  a 
BET  LOt  action  leads  to  a  random  bet  between  1  and  9.  Thus,  the  performance 
standard  provides  the  complete  right-hand  side  of  the  training  rule. 

The  left-hand  side  of  the  training  rule  is  obtained  by  examining  a  table 
called  the  deeition  matrix.  The  decision  matrix  contains  four  abstract  rules, 
one  for  each  possible  action.  These  rules  tell  which  values  of  the  seven 
state  variables  are  relevant  for  the  indicated  action.  The  exact  values  of  the 
variables  arc  not  given — only  a  general  indication  of  whether  the  values  should 
be  large  or  small.  For  instance,  the  abstract  rule  for  the  DROP  action  is 

(CUR RUT.  URGE,  LARGE,  SHALL,  SHALL,  CURRENT,  LARGE)  =»  DROP, 
or  more  clearly, 


If: 

VDIAVD  “  (current  symbolic  value  of  VDBAXD) 

and 

POT  **  LARGE 

and 

LASTBET  2*  LARGE 

and 

BLUFFO  sc  SHALL 

and 

POTBET  =«  SHALL 

and 

ORP  =2  (current  symbolic  value  of  ORP) 

and 

OSTTLE  =«  LARGE , 

Then: 

DROP. 

Once  the  learning  clement  has  deduced  from  the  axioms  that  the  proper 
action  would  have  been  DROP,  it  takes  the  corresponding  rule  from  the  decision 
matrix  and  uses  it  as  the  training  rule.  Notice  that  the  level  of  abstraction  of 
the  rules  in  the  decision  matrix  is  the  same  as  the  level  of  abstraction  of  the 
advice  supplied  by  the  human  expert  or  expert  program. 

It  could  be  argued  that  the  use  of  the  decision  matrix  is  improper,  since 
it  provides  the  learning  element  with  essential  information  that  a  person  who 
was  learning  to  play  poker  would  have  to  discover  himself.  Waterman  (1968) 
suggests  some  methods  by  which  the  decision  matrix  could  be  learned  from 
experience,  but  none  of  these  was  implemented. 

(Jiing  tht  Training  Rule  to  Modify  tht  Knowledge  Date 

Once  the  training  rule  is  obtained,  whether  by  'dvice  from  a  person,  by 
advice  from  the  expert  program,  or  by  analysis,  it  must  be  used  to  modify 
the  production  rules  in  the  knowledge  base.  The  training  rule  is  first  used 
to  modify  the  interpretation  rules.  The  left-hand  side  of  the  training  rule  is 
compared  with  the  state  vector  computed  by  the  interpretation  rules.  LARGE 
matches  symbolic  values  that  correspond  to  large  values  of  the  underlying 
variable.  Similarly,  SHALL  matches  small  values.  If  a  symbol  does  not  match, 
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the  interpretation  rules  that  computed  th.*-.  •<;.  bol  are  assigned  blame.  They 
are  then  either  modified  or  augmented  to  ..  hide  a  new  interpretation  rule. 

Suppose,  for  example,  that  the  state  voc •  r  listed  POT  as  having  the  value 
P3,  where  PS  is  derived  by  the  interpret  U;.-;,  rule: 

If  POT  >  JO.  ..„n  POT  ■  P3 . 

Furthermore,  suppose  that  the  value  o!  rOT  in  the  game  situation  being  ana* 
lyied  is  45.  By  comparing  P3  with  LARGE,  the  learning  element  determines  that 
this  interpretation  rule  is  incorrect  (since  PS  can  refer  to  very  small  values  of 
POT).  The  learning  clement  can  either  modify  the  rule  (by  substituting  44  for 
20)  or  create  a  new  rule.  A  user-supplied  parameter,  KK,  specifies  the  largest 
allowable  change  that  can  be  made  to  a  numeric  value  in  an  interpretation 
rule.  In  this  case,  we  will  assume  that  the  learning  element  creates  the  new 
rule 

If  POT  >  44.  then  POT  »  P4. 

and  modifies  the  state  vector  so  that  POT  has  the  value  P4. 

Once  the  interpretation  rules  have  been  checked  and  modified,  the  up¬ 
dated  state  vector  is  matched  against  the  action  rules  to  find  the  rule  that 
made  the  incorrect  decision.  This  rule  is  called  the  trror-eauaing  rule.  The 
training  rule  is  then  used  to  locate  a  production  rule  that  could  have  made 
the  correct  decision  had  it  been  executed.  This  is  accomplished  by  comparing 
the  right-hand  side  of  the  training  rule  with  each  production  rule  in  the  rule 
base. 

Waterman’s  program  classifies  action  rules  as  either  recently  hypothenxed 
or  accepted.  A  recently  hypothesised  rule  is  one  that  was  recently  added  to  the 
knowledge  base,  whereas  an  accepted  rule  is  one  that  the  program  believes  to 
be  nearly  correct.  The  learning  element  follows  a  strategy  of  first  attempting 
to  make  minor  changes  in  accepted  rules  and  then,  if  minor  changes  do  not 
suffice,  attempting  to  make  major  changes  in  recently  hypothesised  rules. 
Finally,  if  a  suitable  recently  hypothesized  rule  cannot  be  found,  the  training 
rule  is  added  to  the  rule  base  and  is  labeled  as  recently  hypothesised. 

The  learning  element  searches  upward  ahead  of  the  error-causing  rule 
for  an  accepted  rule  that  would  have  made  the  correct  decision.  If  such  a 
rule  is  found,  it  is  checked  to  see  if  the  pattern  of  its  left-hand  side  can  be 
generalised  to  match  the  current  state  vector.  Only  minor  generalisations — 
that  is,  changes  to  the  interpretation  rules — are  considered.  No  conditions 
are  dropped  (i.e.,  replaced  by  «). 

If  no  accepted  rule  can  be  found,  the  learning  element  again  searches 
upward  before  the  error-causing  rule,  this  time  looking  for  a  recently  hypothe¬ 
sised  rule  that  would  have  made  the  correct  decision.  If  such  a  rule  is 
found,  major  changes — including  both  dropping  conditions  and  modifying 
interpretation  rules — are  made  in  the  left-hand-side  pattern  so  that  it  matches 
the  state  vector. 
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If  no  suitable  rules  can  be  found  before  the  error-causing  rule,  the  learning 
element  searches  for  an  accepted  rule  after  the  error-causing  rule.  If  an 
appropriate  rule  is  found  there,  the  error-causing  rule  and  all  intervening 
rules  must  be  specialised  so  that  they  will  not  match  the  state  vector,  and 
the  target  rule  must  be  generalised — by  changing  the  interpretation  rules — so 
that  it  will  match  the  state  vector. 

Finally,  if  no  rules  can  be  found  that  could  be  generalised  to  make  the 
correct  decision,  the  training  rule  is  inserted  into  the  ordered  list  of  production 
rules  immediately  in  front  of  the  error-causing  rule.  The  training  rule  is 
marked  as  being  recently  hypothesised.  Figure  D5b-1  depicts  this  four-step 
process  of  modifying  the  rule  base. 

This  four-step  process  combines  the  task  of  integrating  new  knowledge 
into  the  knowledge  base  with  the  task  of  generalising  the  training  rule.  Notice 
that  the  integration  process  must  have  knowledge  about  how  the  performance 
clement  chooses  which  rule  to  execute,  so  that  it  can  decide  how  to  update  the 
rule  base.  The  generalisation  process  is  fairly  ad  hoe.  For  example,  recently 
hypothesised  rules  become  accepted  when  enough  conditions  are  dropped  from 
the  left-hand  side  so  that  on!)  N  conditions  remain  (IV  is  a  parameter  given 
to  the  program).  This  is  a  very  weak  technique  for  preventing  rules  from 
becoming  overgcneralisedl 


Result* 

Waterman’s  poker  program  learned  to  play  a  fairly  good  game  of  poker. 
Separate  testa  were  conducted  with  each  of  the  three  training  techniques.  In 
each  case,  the  program  started  with  only  one  rule:  “In  all  situations,  make  a 
random  decision.”  For  advice-taking  from  a  human  expert  and  for  learning 

I 

I 

Search  for  “accepted  rule" 


j  (2)  Search  for  “recently-hypothesised”  rule 

•  1  (Jy*~ -  Insert  ti 

error-causing  rule  — 


Insert  training  rule 


Search  for 


“accepted  rule" 


Figure  D5b-1.  The  four  steps  to  modifying  the  production-rule  base. 
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from  the  expert  program,  training  was' continued  until  the  program  played 
one  complete  game  of  five  hands  without  once  making  an  incorrect  decision 
(as  judged  by  the  expert).  For  the  analytic  method,  the  program  continued  to 
play  games  until  the  original  “random  decision”  production  rule  was  executed 
only  5%  of  the  time.  The  results  are  shown  in  Table  D5b-1. 

The  rightmost  column  shows  the  results  of  a  proficiency  test  in  which  the 
program  and  a  human  expert  played  two  sets  of  25  hands.  During  the  first 
set  of  25  hands,  the  cards  were  drawn  at  random  from  a  shuffled  deck  as  in 
ordinary  play.  However,  during  the  second  set  of  25  hands,  the  same  hands 
were  used  as  in  the  first  set,  except  that  the  program  received  the  hands 
originally  dealt  to  the  person  and  vice  versa.  At  the  end,  the  cumulative 
winnings  of  the  program  and  person  were  compared. 

The  results  show  that  in  all  three  training  methods,  performance  improved 
markedly.  The  automatic  training  provided  the  best  performance  improve¬ 
ment,  perhaps  because  the  automated  expert  played  more  consistently  than 
the  human  expert.  Although  the  analytic  method  performed  the  poorest,  the 
results  arc  not  strictly  comparable,  since  the  ariom  set  provided  it  with  only 
four  possible  actions,  whereas  the  advice-based  methods  were  given  eight  pos¬ 
sible  actions.  Consequently,  the  analytic  method  may  not  actually  be  inferior 
to  the  two  advice-taking  methods. 

Conehuion 

Waterman’s  poker-playing  program  faces  a  very  difficult  learning  problem. 
Poker  is  a  multiple-step  task  that  provides  very  little  feedback  to  the  learning 
program.  For  the  two  advice-taking  methods,  this  problem  is  sidestepped 
by  allowing  the  program  to  accept  a  training  rule  directly  from  an  expert. 
However,  for  the  analytic  method,  two  credit-assignment  problems  must  be 
solved:  evaluating  a  round  of  play  and  evaluating  a  particular  move.  To  solve 
these  problems,  the  program  modifies  its  betting  strategy  (to  call  instead 


Table  D5b-i 

Comparison  of  Three  Training  Methods  (from  Waterman,  1070) 


Training  method 

Number  of 
training  trials 

Final  number 
of  rules 

Percent  difference 
in  winnings* 

Before  training 

0 

1 

-71.0 

Advice-taking 

39 

26 

-6.8 

Automatic  training 

20 

10 

-1.0 

Analytic  method 

57 

14 

-13.0 

•These  percentages  are  computed  by  subtracting  the  amount  of  money  won . 
by  the  opponent  from  the  amount  of  money  won  by  the  program  and  dividing  by 
the  amount  of  money  won  by  the  opponent.  In  ail  cases,  the  program  won  lee* 
than  the  opponent  and,  hence,  the  percentages  are  all  negative. 
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of  dropping)  mod  applies  knowledge  available  from  the  axiom  set  and  from 
the  decision  matrix.  This  permits  the  credit- assignment  process  to  extract  a 
training  rule  from  the  trace  of  decisions  taken  by  the  performance  element. 
Once  the  training  rule  is  acquired  by  any  cf  these  three  methods,  it  is  used 
to  guide  the  generalisation  and  specialisation  of  the  production  rules  in  the 
knowledge  base.  Since  only  positive  training  instances  are  available,  the 
program  must  make  use  of  arbitrary  constraints  to  prevent  overgeneralisation. 

Refereneet 

Waterman  (1970)  describes  this  work  in  detail. 


D5c.  HACKER 


HACKER  ia  a  learning  system  developed  by  Gerald  Sussman  (1975)  to  model 
the  process  of  acquiring  programming  skills.  HACKER’s  performance  task  is 
to  plan  the  actions  of  a  hypothetical  one-armed  robot  that  manipulates  stacks 
of  toy  blocks.  This  planning  task  is  described  in  detail  in  Article  XV.C. 

HACKER  learns  by  doing.  It  develops  plans  and  simulates  their  execution. 
The  plan  and  the  trace  of  the  execution  arc  examined  by  HACKER  to  acquire 
two  kinds  of  knowledge:  generalised  subroutines  and  generalised  bugs.  A  gen¬ 
eralised  subroutine  ia  similar  to  a  STRIPS  macro  operator  (see  Article  II.DS,  in 
Vol.  I),  in  that  it  provides  a  sequence  of  actions  for  achieving  a  general  goal. 
A  generalised  bug  is  a  demon  that  inspects  new  plans  to  see  if  they  contain 
an  instance  of  the  bug  and  provides  an  appropriate  bug  lix. 

An  example  of  a  generalised  subroutine  is  the  following  procedure  for 
stacking  one  block  on  top  of  another: 

(TO  (HAKE  (OH  s  b)) 

(BPtOO 

(uitil  (y)  (cahkot  (assigh  (y)  (OH  y  »))) 

(HAKE  (MOT  (OH  y  «))) 

(PUTOM  ib))). 

The  goal  of  this  procedure  is  (HAKE  (OH  i  b)):  The  procedure  changes  the 
world  so  that  (OH  a  b)  is  true.  This  subroutine  is  general  and  works  for  any 
two  blocks  a  and  b  (a  and  b  are  variables  that  are  bound  to  particular  blocks — 
denoted  by  capital  letters — when  the  subroutine  is  invoked).  The  procedure 
removes  everything  that  is  on  a  and  then  picks  up  a  and  puts  it  on  b. 

Viewed  as  a  production  rule,  this  procedure  could  be  written  as: 

(HAKE  (OH  a  b)>  «4  (HPKOG 

(UHTIL  (y)  (CAHKOT  (ASSXGH  (y)  (OH  y  a)}) 
(HAKE  (HOT  (OH  y  a))) 

(PUTOM  a  b) )  . 

From  this  perspective,  we  see  that  when  HACKER  learns  a  generalised  sub¬ 
routine,  it  is  learning  both  a  generalised  left-hand  side,  the  goal,  and  a  general¬ 
ised  right-hand  side,  the  plan.  As  we  will  see  below,  the  left-hand  sides  of  the 
production  rules  are  generalised  by  turning  constants  into  variables,  while  the 
right-hand  sides  arc  developed  by  concatenating  subplans  and  ordering  them 
properly  to  form  macro  operators. 

An  example  of  the  other  kind  of  knowledge  gained  by  HACKER — a  general¬ 
ised  bug— is  the  demon: 

(VATCH-FOt  (ORDER  (PUIPOSE  Uine  (ACHIEVE  (OH  ib))) 

(PURPOSE  2 line  (ACHIEVE  (OH  b  c)))) 
(PREREQUISITE-CLOBBERS -BROTHER-COAL 

current-prog  lllne  2 line 
(CLEARTOP  b)))  . 
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It  tells  HACKER  to  watch  for  plans  in  which  one  step,  lllne,  has  the  goal 
of  achieving  (01  a  b)  and  a  subsequent  step,  211na,  has  the  goal  of  achieving 
(OH  be).  In  such  cases,  the  prerequisite  of  the  second  step — that  b  have 
a  clear  top — requires  undoing  the  goal  of  the  first  step.  When  this  demon 
detects  such  bugs,  it  invokes  the  PBEBEQUISITE-C1.0BBEBS-BB0THEB-C0AL  repair 
procedure  to  fix  them. 

Generalised  bugs  can  also  be  viewed  as  production  rules.  This  particular 
bug  demon  could  be  written  as: 

(OBDEB  (PUBPOSE  Ulna  (ACHIEVE  (OH  a  b») 

(PURPOSE  211m  (ACHIEVE  (01  b  ())))  *4 

(PBEBEQUISITE-C10BBEBS-BB0THEH-G0AL 

current-prog  lliae  211a« 

(ClEABTOP  b>)  . 

HACKER  learns  both  the  left-  and  the  right-hand  sides  of  these  bug  demons. 
HACKER '»  Architecture 

HACKER  is  a  complex  program  that  contains  several  interleaved  com¬ 
ponents  (see  Fig.  DSc-1).  These  include: 

1.  The  planner,  which  develops  plans  by  pattern-directed  expansion  of  plan¬ 
ning  operators; 

2.  The  critics'  gallery,  which  inspects  the  plans  for  known  generalised  bugs; 

3.  The  emulator,  which  simulates  the  execution  of  the  plans  and  checks  for 
errors; 

4.  The  debugger  and  gtneralUer,  which  locate  and  repair  bugs  in  the  plana 
for  later  use  by  the  critics'  gallery;  and 

5.  The  generaUcer  and  subroutine  •  which  generalize  plana  and  install  them 
in  HACKER's  knowledge  base. 

The  first  two  components  comprise  the  performance  element,  which  develops 
block-stacking  plans.  The  simulator  creates  a  performance  trace  of  the  simu¬ 
lated  execution  of  the  plan.  The  last  two  components  perform  the  actual 
process  of  learning  generalized  subroutines  and  generalized  bugs. 

These  components  interact  continually.  As  the  planner  is  developing  the 
plan,  for  example,  the  critics’  gallery  is  interrupting  to  repair  known  bugs 
and  the  simulator  is  symbolically  executing  the  evolving  plan.  'The  debugger 
may  step  in  to  fix  a  new  bug  and  then  resume  the  planning  process.  In  this 
article,  however,  we  describe  each  of  these  components  separately  and  pretend 
that  the  plan  is  first  developed  in  its  entirety  and  then  successively  criticized, 
simulated,  debugged,  and  generalized.  This  false  architecture  corresponds 
fairly  closely  to  our  simple  model  of  learning  multiple-step  tasks.  There  are 
two  learning  elements,  however:  one  for  developing  generalized  subroutines 
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and  one  for  developing  generalised  bugs.  Figure  D5c-l  summarises  this  false 
architecture.  We  will"  explain  the  operation  of  HACKER  by  following  the  How 
through  this  model. 

HACKER'*  Performance  Element: 

The  Planner  and  the  Critic*’  Gallery 

HACKER  employs  a  simple  problem-reduction  planner  (Chap.  XV;  see  also 
Article  I1.B2,  in  Vol.  l),  which  is  presented  with  an  initial  situation  and  a  goal 
block-structure  to  create.  Figure  D5c-2  shows  a  sample  situation  and  goal. 

The  goal  is  matched  against  HACKr  R’s  knowledge  base  of  known  plans, 
subroutines,  and  refinement  rules.  If  a  known  plan  or  subroutine  is  found  that 


Performance  Element 


Bug  Learning  Element 


Figure  D5c-1.  A  simplified  architecture  for  HACKER. 
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B 


Goal:  (ACHIEVE  (AID  (01  A  B)  (OK  C  A))) 

Figure  DSc-2.  A  sample  situation  and  goal. 

can  accomplish  the  goal,  it  is  used.  Otherwise,  a  refinement  rule  is  applied 
to  reformulate  the  goal  as  a  set  of  subgoala.  These  subgoals,  in  turn,  are 
matched  against  the  knowledge  base  to  locate  known  methods  for  achieving 
them.  The  expansion  into  subgoals  proceeds  until  HACKER  finds  existing 
plans  or  primitive  operators  that  can  achieve  each  of  the  subgoals. 

HACKER  is  noted  for  its  linearity  assumption.  Whenever  the  planner  is 
faced  with  the  problem  of  achieving  a  pair  of  conjunctive  subgoals,  it  assumes 
that  they  can  be  achieved  independently.  This  assumption  is  represented  in 
the  AID  rule  for  refining  a  conjunctive  goal: 

(TO  (ACHIEVE  (AID  a  b) ) 

(AND  (ACHIEVE  a) 

(ACHIEVE  W»  . 

This  says  “To  achieve  goals  a  and  b,  first  achieve  a  and  then  achieve  4."  As 
a  result  of  this  linearity  assumption,  the  plan  developed  by  the  planner  is  a 
naive  plan  that  may  not  work  (see  Article  XV.C). 

The  naive  plan  is  criticised  by  the  critics  in  the  critics’  gallery,  which 
attempt  to  find  instances  of  the  generalised  bugs  kept  in  the  bug  library. 
When  a  bug  is  found,  the  associated  bug  fix  is  applied  to  improve  the  plan — 
usually  by  rearranging  plan  steps.  The  result  of  this  criticism  is  a  plan  that 
reQccts  all  of  HACKER’s  past  experience  but  still  may  not  be  correct. 

HACKER’*  Performance  Trace: 

Plan*  and  Simulation 

HACKER's  plans  contain  a  large  amount  of  information  about  the  plan* 
ning  process  itself.  Each  step  of  a  plan  is  justified  by  giving  the  purpose  of  the 
step — the  subgoal  it  is  intended  to  achieve.  There  are  two  fundamental  kinds 
of  steps:  main  steps  and  prerequisite  steps.  Main  steps  are  directed  at  goals 
relating  to  the  goals  of  the  overall  plan.  Prerequisite  steps  are  computations 
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needed  to  establish  preconditions  for  th'e  main  steps.  For  example,  the  plan 
for  the  problem  of  Figure  D5c-2  contains  three  steps: 

Step  1.  (POTOI  C  TABLE)  [porpo**:  (CLEARTOP  A)  apan:  step  2]  . 

Step  2.  (POTOI  A  B)  Cparpoae:  (OH  A  B)  spaa:  lull  plan]  . 

Step  3.  (POTOI  C  A)  [purpose:  (01  C  A)  spaa:  Tull  plan]  . 

Steps  2  and  3  are  main  steps,  while  step  1  is  a  prerequisite  step  needed  to 

clear  off  the  top  of  A  so  that  the  robot  can  move  A.  As  HACKER  simulates  the 
execution  of  the  plan,  it  verifies  that  the  goal  of  each  step  has  been  attained. 

Each  step  in  the  plan  also  includes  an  indication  of  the  time  span  of  the 
goal  it  is  attaining.  The  purpose  of  a  step  max  be  to  accomplish  something 
that  will  remain  true  for  only  a  short  time.  In  this  example,  (CLEARTOP  A)  will 
be  true  only  until  step  3.  For  HACKER  to  know  that  this  is  not  a  bug,  step  1 
includes  a  time-span  indication  that  its  goal  is  intended  to  be  true  only  until 
the  end  of  step  2. 

The  criticised  plan  is  simulated  to  verify  that  it  works  properly.  The 
simulator  detects  bugs  in  three  forms:  illegal  operations,  failed  steps,  and 
unaesthetic  actions.  An  illegal  operation  is  one  that  is  considered  impossible 
in  the  hypothetical  blocks  world.  For  instance,  it  is  illegal  to  pick  up  a 
block  unless  it  has  a  clear  top.  A  failed  step  is  one  that  does  not  achieve  its 
goal  for  the  designated  lime  span.  The  simulator  uses  the  goal  information 
attached  to  each  plan  step  to  verify  that  at  all  times  the  goals  intended  by  the 
planner  have  actually  been  met.  Lastly,  an  unaesthetic  action  is  a  situation 
in  which  the  robot  moves  the  same  block  two  times  in  succession  without 
any  intervening  actions.  These  three  methods  for  detecting  bugs  provide  a 
performance  standard  tor  HACKER,  which  states  that  a  plan  must  execute 
legally,  achieve  all  intended  goals  and  subgoals,  and  also  be  aesthetically 
correct.  The  simulation  halts  whenever  one  of  these  problems  is  identified, 
and  a  trace  of  the  simulation  is  provided  to  the  bug  learning  element. 


HACKER'i  Learning  Elements: 

The  Subroutine  Learning  Element  and  the  Dug  Learning  Element 

As  mentioned  above,  there  are  two  learning  elements  in  FLACKER.  One, 
the  subroutine  learning  clement,  inspects  the  criticized  plan  and  simulation 
trace  to  identify  possible  subroutines.  The  other,  the  bug  learning  element, 
examines  the  performance  trace  to  diagnose  and  correct  bugs  uncovered  by 
the  simulation. 

The  subroutine  learning  dement  attempts  to  detect  when  two  subgoals 
in  the  plan  arc  sufficiently  similar  to  allow  a  single  subroutine  to  accomplish 
both.  The  trace  of  the  planning  and  simulation  processes  indicates  which 
constants  in  a  goal  or  subgoal — for  example,  the  constants  A  and  B  in  the 
goal  (ON  A  B) — can  be  generalized.  A  constant  cannot  be  generalized  if  the 
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plan  somehow  refers  to  that  constant  explicitly  (e.g.,  the  constant  TABLE  has 
special  status).  HACKER  generalises  each  subgoal  in  the  plan  by  turning 
all  generalizable  constants  into  variables.  The  generalized  subgoal  is  then 
compared  with  all  other  goals  in  the  program.  Any  two  subgoaia  found  to  have 
an  allowable  common  generalization  are  replaced  by  calls  to  a  parameterized 
procedure.  This  generalization  process  is  similar  to  the  technique  used  in 
STRIPS  to  generalize  macro  operators. 

As  an  example,  consider  the  block-stacking  task  of  Figure  D5c-2.  The  ini¬ 
tial  plan  involves  separate  steps  for  achieving  (ON  A  B)  and  (ON  C  A) .  However, 
traces  of  the  planning  and  simulation  processes  indicate  that  the  code  for 
(OH  A  B)  will  work  for  any  variables  a  and  ▼.  The  generalized  goal  (OH  a  t) 
is  checked  against  other  goals  in  the  plan  and  found  to  match  the  sub¬ 
goal  (OH  C  A).  As  a  result,  HACKER  formulates  a  generalized  subroutine, 
(HAKE-OH  a  ▼),  and  replaces  the  subplans  for  steps  2  and  3  with  calls  to  HAXE- 
0H.  The  HAKE-OH  subroutine  is  placed  in  the  knowledge  base  for  use  in  future 
plans  as  well. 

The  subroutine  learning  clement  can  be  regarded  as  learning  from  exam¬ 
ples.  The  goals  and  subgoals  in  a  particular  plan  form  the  training  instances, 
which  are  generalized  by  turning  constants  into  variables.  The  distinctive 
aspect  of  the  HACKER  approach  is  that  the  search  of  the  rule  space  is  accom¬ 
plished  very  directly.  HACKER  (and  its  predecessor,  STRIPS)  is  able  to  reason 
about  how  the  different  steps  in  the  plan  depend  on  particular  values  for  the 
arguments  of  the  goal  statement.  From  this  dependency  analysis,  the  correct 
generalization  can  be  deduced  directly.  HACKER  thus  differs  from  most  of 
the  other  learning  methods  described  in  this  chapter  in  that  it  is  able  to  use 
the  meanings  of  its  operators  to  guide  the  generalization  process. 

The  bug  learning  element  faces  a  much  more  difficult  learning  task.  It 
must  determine  why  the  plan  failed  and  repair  the  plan.  Then  it  must  attempt 
to  generalize  the  discovered  bug  and  create  a  bug  critic  that  will  prevent 
the  bug  from  reappearing  in  future  plans.  The  first  task — determining  why 
the  plan  failed — is  the  problem  of  credit  assignment.  The  traditional  credit- 
assignment  problem  is  to  determine  which  rule,  used  in  the  performance 
element,  led  to  the  mistake.  In  HACKER's  case,  there  is  one  fundamental 
source  ol  error:  the  linearity  assumption  as  implemented  by  the  AND  rule. 
HACKER's  credit  assignment,  instead,  involves  determining  how  the  current 
planning  task  violates  this  linearity  assumption— that  is,  how  do  the  subplans 
in  this  problem  interact? 

HACKER's  solution  to  the  credit-assignment  problem  is  to  compare  the 
intentions  and  expectations  of  the  performance  element  with  what  actually 
happened.  This  approach  again  relies  on  knowledge  of  the  semantics  of  the 
operators  to  assign  blame  to  individual  steps.  This  is  more  direct  than  the 
weaker,  more  empirical  approach  of  comparing  many  possible  plans  obtained 
through  a  more  widespread  search,  as  in  Samuel’s  checkers  program  aud  the 
LEX  system. 
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Figure  D5c-3.  The  PmESUISITE-CLOBBERS-BROTaEA-COU. 
bug  schema. 

HACKER  has  a  small  library  of  schemas  th-\t  describe  possible  subgoal 
interactions.  Credit  assignment  is  accomplished  by  matching  these  schemas 
to  the  goal  structure  of  the  current  plan  and  performance  trace.  For  example, 
one  class  of  interactions,  the  PHEREQUISITE-aOBBEHS-BROTra-OQAL,  involves 
the  goal  structure  depicted  in  Figure  D5c-3. 

The  prerequisite  step  of  goal  2  somehow  makes  goal  t  no  longer  true.  For 
example,  if  the  overall  goal  is  (ACHIEVE  (AHD  (OH  A  B)  (Oi*  B  C))),  we  have 
the  subgoal  structure  shovra  in  Figure  D5c-4. 


(AHD  (OH  A  B)  (OH  B  C)> 


(CLEARTOP  B) 


Figure  DSc-4.  A  subgoal  structure  thrt  matches  the  bug  schema 
of  Figure  D5c-3. 
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HACKER  simulates' this  plan  by  first  placing  block  A  on  block  B,  then 
clearing  off  B  so  that  it  can  place  B  on  C.  The  clearing-off  process  makes 
(OB  A  B)  false — the  prerequisite  of  goal  2  has  clobbered  goal  1.  (This  is 
detected  by  the  simulator  when  it  checks  the  time  span  of  each  subgoal.) 

Each  of  ILACKER's  bug  schemas  describes  some  general  goal  structure 
that  can  be  matched  to  the  goal  structure  of  the  current  plan.  The  matching 
process  is  implemented  in  an  ad  hoc  fashion  as  a  series  of  six  questions  that  the 
debugger  asks  of  the  performance  trace.  As  a  result  of  the  matching  process, 
the  bug  is  ignored  as  innocuous,  is  properly  classified,  or  is  found  to  be  too 
difficult  to  repair. 

The  process  of  repairing  the  plan  is  straightforward.  Each  bug  schema 
contains  instructions  on  how  to  repair  the  bug.  These  can  involve  reorder¬ 
ing  plan  steps,  creating  new  subp'ai  s  that  establish  prerequisite  conditions, 
and  even  removing  unnecessary  plai  steps.  The  resulting  repaired  plan  is 
simulated  again  to  detect  further  bu'^j. 

The  process  of  generalizing  the  bug  is  dso  easily  accomplished.  Each  bug 
schema  contains  instructions  regarding  which  components  of  the  goal  struc¬ 
ture  can  be  generalized  by  turning  constants  into  variables.  For  instance,  the 
bug  schema  for  PREREQUISITE-CLOBBERS-BRLiHER-COAL  contains  the  instructions 

(CSETQ  go.ll  (VARIABLIZE  (COAL  linel)) 
go»12  (VARIABLIZE  (COAL  Un»2)) 
prereq  (VARIABLIZE  pro))  , 

where  lin«l  refers  to  the  first  goal  (whose  prerequisite  Was  clobbered),  Iine2 
refers  to  the  search  goal,  and  pr«raq  refers  to  the  prerequisite  that  did  the 
clobbering.  These  instuctions  tell  HACKER  to  analyze  the  dependencies  in 
the  performance  trace  and  generalize  all  three  of  these  goal  expressions.  The 
resulting  generalized  goal  structure  shown  in  Figure  D5c-5  is  compiled  into  a 
demon  and  added  to  the  bug  library  for  use  in  subsequent  criticism  of  naive 
plans. 

The  bug  learning  element  can  be  regarded  as  learning  by  schema  instan¬ 
tiation.  Over  time,  HACKER  discovers  new  situations  in  which  particular 
kinds  of  subgoal  interactions  occur,  generalizes  these  situations,  and  watches 
for  them  in  future  plans.  It  does  not  tackle  the  problem  of  discovering  these 
classes  of  bugs  in  the  first  place,  nor  does  it  address  the  problem  of  discovering 
techniques  for  fixing  bugs. 

Conclution 

HACKER  is  a  system  that  learns  to  develop  plans  for  manipulating  toy 
blocks.  It  acquires  two  kinds  of  knowledge — generalized  subroutines  and 
generalized  bugs.  Roth  of  HACKER's  learning  elements  make  extensive  use  of 
the  performance  trace,  which  consists  of  the  plan  (annotated  with  goal  infor¬ 
mation)  and  a  trace  of  the  simulated  execution  of  the  plan.  The  subroutine 
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Figure  D5e-3.  The  mtEQUISITE-CLOBBEHS-BHOTIEI-COAI. 
bug  schema. 

HACKER  haa  a  small  library  of  schemas  that  describe  possible  subgoal 
interactions.  Credit  assignment  is  accomplished  by  matching  these  schemas 
to  the  goal  structure  of  the  current  plan  and  performance  trace.  For  example, 
one  class  of  interactions,  the  PtEH£QOISITE-Ct.OB8ERS-BROTIE>-GOAL,  involves 
the  goal  structure  depicted  in  Figure  D5c-3. 

The  prerequisite  step  of  goal  2  somehow  makes  goal  1  no  longer  true.  For 
example,  if  the  overall  goal  is  (ACHIEVE  (AID  (OH  A  B)  (01  1  C))),  we  have 
the  subgoal  structure  shown  in  Figure  D5c~4. 


(AMD  (OM  A  B)  (0M  B  C)) 


(OH  A  B)  (OH  B  C) 


(CLEARTOP  B) 


A  subgoal  structure  that  matches  the  bug  schema 
of  Figure  D5c-3. 
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(AND  (ON  x  y)  (ON  y  *)) 


(CLEARTOP  j) 

Figure  DSc-5.  A  generalized  goal  structure. 

lemming  element  generalizes  by  analyzing  the  goal  structure  in  the  perfor¬ 
mance  trace  to  determine  which  constants  can  be  turned  into  variables.  The 
bug  learning  clement  accomplishes  credit  assignment  by  instantiating  schemas 
that  describe  bug-inducing  goal  structures.  The  schemas  provide  guidance 
for  bug  repair  and  generalization.  Much  of  HACKER’S  impressive  behavior 
derives  from  its  ability  to  reason  about  the  semantics  of  its  task.  The  value  of 
m  transparent  performance  element  for  credit  assignment  and  generalization 
is  very  evident  in  HACKER. 

Reference* 

HACKER  is  described  in  Sussman’s  (1973)  thesis.  Doyle  (1980)  describes 
a  formalization  of  the  concepts  of  goal  and  intention  as  used  by  HACKER.  An 
alternative  to  the  linearity  assumption  is  described  in  Article  XV.Di. 


D5d.  LEX 


LEX,  a  system  designed  by  Thomas  Mitchell  (see  Mitchell,  UtgolT,  and  Banerji, 
in  press;  Mitchell,  Utgoff,  Nude),  and  Banerji,  1981),  learns  to  solve  simple 
symbolic  integration  problems  from  experience.  LEX  is  provided  with  an 
initial  knowledge  base  of  roughly  50  integration  and  simplification  operators, 
some  of  which  arc  shown  in  Table  D5d-1.  The  goal  of  LEX  is  to  discover 
heuristics  for  when  to  apply  these  operators.  That  is,  LEX  seeks  to  develop 
production  rules  of  the  form 

(situation)  Apply  operator  OPt, 

where  (situation)  is  a  pattern  that  is  matched  against  the  current  integration 
problem.  The  situations  arc  expressed  in  a  generalization  language  of  possible 
patterns.  For  instance,  a  heuristic  rule  for  operator  0P12  might  be: 

J  /(x)  transe  (x)  dx  Apply  OP  12  with  u  =  /(x)  and  dv  =  transe  (x)  dx . 

This  tells  the  LEX  performance  element  that  if  it  sees  any  problem  whose 
integrand  is  the  product  of  any  function,  /(x),  with  a  transcendental  function, 
transe  (x),  then  it  should  apply  OP12  with  u  bound  to  f(x)  and  dv  bound  to 
transe  (x)  dx.  The  concepts  of  /(x)  and  transe  (x)  are  part  of  the  generalisation 
language  (illustrated  later  in  Fig.  D5d-4). 

Mitchell  calls  these  production  rules  heuristics  because  they  provide  heuris¬ 
tic  guidance  to  LEX’s  performance  element,  which  is  a  simple,  forward-chaining 
production  system  (see  Sec.  1I.B,  in  Vol.  l).  Without  any  heuristic  rules,  the 
performance  element  conducts  a  blind  uniform-cost  search  (see  Article  II.C1,  in 
Vol.  i)  of  the  space  of  all  legal  sequences  of  operator  applications.  Consider  the 
problem  of  integrating  /3xcosxdx.  Without  any  heuristics,  LEX  produces 
the  rather  large  search  tree  shown  in  Figure  D5d-1.  It  is  no  surprise  that 

Table  D5d-i 

Selected  Integration  Operators  in  LEX 


OP02  convert  f  xr  dx  to  xr+1/(r  +  1)  (power  rule) 

OP03  convert  /  rf[x)dx  to  r  f  J(x)  (factor  out  a  real  constant) 

Ol'OS  convert  /  sin  x  dx  to  —  cos  x 

OP08  convert  1  ■  /(x)  to  /(x) 

01*10  convert  /  cos  x  dx  to  sinx 

OP12  convert  f  udv  to  u v  —  J  vdu  (integration  by  parts) 

OP13  convert  0  •  /(x)  to  0 
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/.lr co«2  dx 


3isinx-  / 3sinxdx 
OF03 

T 

3x  sin  2  —  3  /  sin*  dx 
OP06 

3*sinx-3(— coex) 

Figure  D5d-1.  Partial  search  tree  for  /  3*  cos  *  dx  without  heuristics. 

when  LEX  has  no  heuristics,  it  often  cannot  solve  integration  problems  before 
exhausting  the  time  and  space  available  to  it. 

The  task  of  learning  the  left-hand  sides  of  heuristic  rules  can  be  thought 
of  as  a  set  of  concept-learning  tasks.  LEX  tries  to  discover,  for  each  operator 
OPi,  the  definition  of  the  concept  situation*  in  which  QPi  should  be  used.  It 
accomplishes  this  by  gathering  positive  and  negative  training  instances  of  the 
use  of  the  operator.  By  analysing  a  trace  of  the  actions  taken  by  the  perfor¬ 
mance  element,  LEIX  is  able  to  find  eases  of  appropriate  and  inappropriate 
application  of  the  operators.  These  training  instances  guide  the  search  of 
a  rule  space  of  possible  left-hand-side  patterns.  The  candidate-elimination 
algorithm  (sec  Article  XIV.D3a)  is  employed  to  search  the  rule  space,  and  par¬ 
tially  learned  heuristics,  for  which  the  candidate-elimination  algorithm  has 
not  found  a  unique  left-hand-side  pattern,  are  stored  as  version  spaces  of 
possible  patterns.  Thus,  the  general  form  of  a  heuristic  rule  in  LEX  is: 

(version  space  represented  as  S  and  G  sets)  Apply  OPi. 

For  example,  after  a  few  training  instances,  LEX  might  have  the  following 
partially  learned  heuristic  for  the  intcgrnlion-by- parts  heuristic,  01’12: 

Version  space  for  OP12: 

G  /  f{x)g(x)dx  *4  0P12,  with  u  =»  /(*)  and  dv  =  j(x)  dx ; 

5  «  /  3*  cos  *  dx  =*>  0P12,  with  u  =*  3x  and  do  =»  cos  z  dx . 


3/  2  COS  2  dx 


3(zsin2  —  (—  cos*)) 


/ 
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This  heuristic  tells  LEX  to  apply  0P12  in  any  situation  in  which  the  integral 
has  the  form  J  f[x)g(^x)dx.  It  also  indicates  that  the  correct  left-hand-side 
pattern  lies  somewhere  between  the  overly  specific  S  pattern,  /  3r  cos  zdx, 
and  the  overly  general  t  pattern,  /  f(x)g(x)dx.  Below,  we  show  how  this 
partially  learned  heuristic  was  discovered  by  LEX. 

LEX’s  Architecture 

LEX  is  organised  as  a  system  of  four  interacting  programs  (see  Fig.  D5d-2) 
that  correspond  closely  to  our  modified  model  of  learning  for  multiple-step 
tasks.  The  problem  solver  is  the  performance  element.  It  solves  symbolic  inte¬ 
gration  problems  by  applying  the  current  set  of  operators  and  their  heuristics. 
When  the  problem  solver  succeeds  in  solving  an  integral,  a  detailed  trace  of 
its  performance  is  provided  to  the  critic,  which  examines  the  trace  to  assign 
credit  and  blame  to  the  individual  decisions  made  by  the  problem  solver. 
Once  credit  assignment  is  completed,  the  critic  extracts  positive  (and  negative) 
instances  of  the  proper  (and  improper)  application  of  particular  operators. 
These  training  instances  arc  used  by  the  generalizer  to  guide  the  search  for 
proper  heuristics  for  the  operators  involved.  Finally,  the  problem  generator 
inspects  the  current  contents  of  the  knowledge  base  (i.e.,  the  operators  and 
their  heuristics)  and  chooses  a  new  problem  to  present  to  the  problem  solver. 

LEX  thus  incorporates  all  four  components  of  our  simple  model:  the 
knowledge  base  (of  operators  and  heuristics),  the  performance  clement,  the 
performance  trace,  and  the  learning  element  (composed  of  the  critic  and  the 
generalixer).  Furthermore,  LEX  is  one  of  the  few  AI  learning  systems  to  include 
an  experiment  planner — the  problem  generator. 

In  this  article,  we  first  present  an  example  of  how  LEX  solves  problems 
and  refines  the  version  spaces  of  its  heuristics.  Then  we  describe  each  of  LEX’s 
components  in  detail  and  discuss  some  open  research  problems. 


Figure  D5d-2.  LEX's  architecture. 


/ 


V 


in  Example 

To  show  how  LEX  works,  suppose  that  the  problem  generator  has  chosen 
the  problem  /  3x  cos  x  dx  and  the  problem  solver  has  produced  the  trace  shown 
earlier  in  Figure  D5d-1.  The  critic  analyses  the  trace  and  extracts  several 
training  instances,  including: 

j  3xcoa  xdx  «*  0P12,  with  u  =»  3z  and  dv  =  cos  xdx  (positive) . 

/  3  sin  z  dx  »*  OPM,  with  r  —  3  and  f(x)  —  sin  z  (positive) . 

Jain  xdx  =♦  OPM  (positive). 


(positive) . 


We  will  watch  how  the  generalizcr  handles  the  training  instance  for  OP  12. 
Let  us  assume  that  this  is  the  first  training  instance  that  has  been  found  for 
this  operator,  so  the  knowledge  base  does  not  yet  contain  any  heuristics  for 
when  to  use  it.  Consequently,  the  gencraliser  will  create  and  initialize  a  new 
OP  12  heuristic.  The  left-hand  side  of  the  heuristic  is  a  version  space  of  the 
form: 

Version  space  for  0P12: 

G  /  /( z)b(z)  d x  *#  0P12,  with  u  *»  /(*)  and  dv  —  g(x )  dx ; 

5a*/3xcosxdz  ■»  OPU,  with  u  3x  and  dv  »»  cos  xdx. 

Notice  that  5  is  a  copy  of  the  training  instance  and  G  is  the  most  general 
pattern  for  which  OP  12  is  legal.  This  heuristic  will  recommend  that  OP  12 
be  applied  in  any  problem  whose  integrand  is  less  general  than  /  f(x)g(x)  dx. 
This  is  not  a  highly  refined  heuristic. 

To  sec  how  LEX  refines  this  heuristic,  let  us  assume  that  the  other  training 
instances  shown  above  have  been  processed.  At  this  point,  the  problem 
generator  chooses  the  problem  /  5x  sin  *  dx  to  solve.  The  problem  solver  will 
apply  OP  12,  since  the  G  set  of  the  heuristic  matches  the  integrand.  Figure 
D5d-3  shows  a  portion  of  the  solution  tree. 

Some  of  the  training  instances  extracted  by  the  critic  are: 

J  5xsin  x  dx  »s  OPU,  with  ti  =  5f  and  dv  =*  si nxdx  (positive) . 

J  5  cos  xdx  =S  OPM,  with  r  =  5  and  /(x)  =  cos  x  (positive) . 

J  cos  xdx  ^  OPIO  (positive). 

J  5xsin  x  dx  «*  0P12,  with  u  =  sin  x  and  dv  =  fix  dx  (negative) . 
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/  5xsinx  dx 

0P12  0P12 


5x2sinx  —  /  $x2  cos  x  dx 


-5x  cos  x  4-  /  5  cos  x  dx 
OP03 

-5x  cos  x  +  5  /  cos  x  dx 
|oPlO 

—ox  coax  4-  5ainx 


Figure  D5d-3.  The  solution  tree  for  /  5xsinxdx. 

The  generalizer  updates  the  version  space  for  OP12  to  contain: 

G  —  {ai,  Jj},  where 

9i :  /  polynom  (x)j(x)  dx  =♦  OP12, 

with  u  =  polynom  (x)  and  dv  =  g(x)dz\ 
gi4.  //(x)transc(x)dx  =»  OP12, 

with  u  =  /(x)  and  dv  =  transc(x)  dx ; 

5  =  {si},  where 

»i:  /  ix  trig  (x)  dx  =»  OP12, 

with  u  =  kx  and  dv  =  trig(x)dx. 

The  positive  training  instance  forces  the  constants  3  and  5  to  be  general¬ 
ized  to  k,  which  represents  any  integer  constant,  and  “sin"  and  “cos”  to  be 
generalized  to  “trig,”  which  represents  any  trigonometric  function,  as  shown  in 
s i •  Similarly,  the  negative  training  instance  leads  to  two  alternative  specializa¬ 
tions.  In  g i,  /  was  specialized  to  “polynom”  to  avoid  a  =  sin  r,  and  in  gg, 
g  was  specialized  to  “transc"  to  avoid  dv  =  5x  dx.  These  two  specializations 
no  longer  cover  the  negative  training  instance.  With  a  few  more  training 
instances,  the  heuristic  for  01’12  converges  to  the  form  shown  at  the  start  of 
this  article,  that  is,  /  /(x)  transc  (x)  dx.  The  concepts  “fc,”  “trig,”  “polynom,” 
and  so  on,  arc  all  part  of  the  generalization  language  known  to  LEX  from  the 
start  (see  Fig.  D5d-4,  shown  later). 

Now  that  we  have  seen  an  example  of  LEX  in  action,  we  describe  each  of 
the  four  components  of  LEX  in  turn. 
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The  Problem  Solver 

As  discussed  above,  the  problem  solver  conducts  a  forward  search  of 
possible  operator  applications  in  an  attempt  to  solve  the  given  integration 
problem.  Initially,  this  search  is  blind.  However,  as  the  heuristics  for  the 
operators  are  refined,  the  search  becomes  more  focused. 

The  problem  solver  conducts  a  uniform- c  out  teareh.  At  each  step,  it 
chooses  the  one  expansion  of  the  search  tree  that  has  the  smallest  estimated 
cost.  The  search  tree  is  maintained  as  a  list  of  open  nodes — that  is,  nodes 
to  which  not  all  legal  integration  operators  have  been  applied.  The  cost  of 
an  open  node  is  measured  by  summing  the  cost  of  each  search  step  (for  both 
time  and  space)  back  to  the  root  of  the  search  tree.  In  addition,  the  cost  of  a 
proposed  expansion  is  weighted  to  reflect  the  strength  of  the  heuristic  advice 
available.  In  detail,  the  problem  solver  chooses  an  expansion  as  follows: 

Step  1.  For  each  open  node  and  each  leg.J  operator,  compute  the  “degree 

of  match*  according  to  the  formula: 

0  if  no  heuristic  recommends  this  operator  for  this  node; 

m/n  if  there  is  a  heuristic,  and  m  out  of  the  n  patterns  in  the 
boundary  sets  of  the  version  space  (i.e.,  the  5  and  G  sets) 
match  the  current  situation. 

Step  2.  Choose  the  expansion  that  has  the  lowest  weighted  cost,  computed 

as: 

(1.5  —  degree  of  match)  X  (cost  so  far  +  estimated  expansion  cost) . 

The  effect  of  the  (1.5  —  degree  of  match)  weight  on  the  cost  is  to  emphasise 
the  cost  of  the  path  when  little  heuristic  guidance  is  available  but  to  ignore 
cost  considerations  as  the  heuristic  recommendation  becomes  stronger. 

The  problem  solver  continues  to  select  nodes  and  apply  operators  until 
the  integral  is  solved.  Notice  that,  in  LEX,  a  simple  performance  standard 
is  available:  solution  of  the  integral.  This  is  a  substantially  simpler  situation 
than  that  faced  by  Waterman's  poker  player,  which  needs  to  play  several 
hands  to  evaluate  how  well  it  is  doing.  LEX  knows  when  it  is  doing  well. 
LEX  also  knows  when  it  is  doing  poorly.  For  each  integration  problem,  the 
problem  solver,  is  given  a  time  and  space  limit.  If  it  runs  out  of  time  or  space 
before  solving  the  problem,  it  gives  up  and  the  problem  generator  selects  a 
new  problem  to  solve. 

The  Critic 

The  problem  solver  provides  the  critic  with  a  detailed  trace  of  each  suc¬ 
cessfully  solved  problem.  The  critic’s  task  is  to  extract  positive  and  negative 
training  instances  from  this  trace  by  assigning  credit  and  blame  to  individual 
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decisions  made  by  the  problem  solver.  The  critic  solves  the  credit- assignment 
problem  as  follows: 

1.  Every  search  step  along  the  minimum-cost  solution  path  found  by  the 
problem  solver  is  a  positive  instance; 

2.  Every  step  that  (a)  ieads  from  a  node  on  the  minimum-cost  path  to  a 
node  not  on  this  path  and  (b)  leads  to  a  solution  path  whose  length  is 
greater  than  or  equal  to  1.15  times  the  length  of  the  minimum-cost  path 
is  a  negative  instance. 

These  criteria  are  intended  to  produce  applicability  heuristics  that  guide 
the  performance  element  to  minimum-coat  solutions.  To  evaluate  these  criteria 
(especially  2b),  the  critic  must  re-invoke  the  problem  solver  to  follow  out 
paths  that  appear  to  be  bad.  This  deeper  search  is  in  some  ways  analogous 
to  the  deep  search  Samuel  used  in  his  checkers- playing  program  for  solving 
the  credit-assignment  problem.  The  criterion  of  minimum-cost  solution  is 
convenient  because  it  can  be  measured  by  the  computer  itself — by  its  own 
experience  in  attempting  to  solve  the  problem. 

The  critic  is  fairly  conservative.  It  provides  the  generalizer  only  with  the 
training  instances  that  can  be  most  reliably  credited  or  blamed.  However, 
the  critic  is  not  infallible.  It  can  produce  false  positive  and  false  negative 
training  instances  when  the  knowledge  base  contains  incorrect  heuristics. 
Since  the  problem  solver  follows  the  guidance  provided  by  the  heuristics  in 
the  knowledge  base,  it  may  believe  it  has  found  the  lowest  cost  solution  when 
in  fact,  the  heuristics  have  led  it  astray.  Since  LEX  does  not  conduct  an 
exhaustive  search  of  the  space,  it  will  not  always  detect  this  fact.  As  a  result, 
the  critic  may  create  false  positive  and  false  negative  instances.  Its  reliability 
can  be  improved  by  increasing  the  safety  factor  (normally  1.15)  when  the 
problem  solver  is  re-invoked  by  the  critic.  This  causes  the  problem  solver 
to  search  more  deeply  along  alternative  paths  and  improves  the  chances  of 
finding  the  true  minimum-cost  path. 

The  Generalizer 

The  generalizer  simply  applies  the  candidate-elimination  algorithm  to 
process  each  of  the  training  instances  provided  by  the  critic  and  to  refine  the 
version  spaces  of  each  of  the  operators.  The  multiple-boundary-set  form  of 
the  algorithm  (see  Article  XIV. D3a)  was  adopted  to  handle  erroneous  training 
instances. 

The  generalizer  is  able  to  learn  disjunctions  in  certain  cases.  During 
generalization  based  on  a  positive  training  instance,  for  example,  if  the  version 
space  would  normally  be  forced  to  collapse  because,  no  consistent  rule  exists, 
a  second  version  space  is  created  instead.  This  second  version  space  contains 
the  patterns  that  are  consistent  with  all  of  the  negative  instances  and  the 
single  new  positive  instance.  As  additional  positive  instances  are  received, 
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they  are  processed  against  any  version  sjbace  whose  G  set  covers  them.  When 
more  than  one  heuristic  rule  is  created  Tor  a  single  operator,  the  effect  is  the 
same  as  if  a  single  disjunctive  heuristic  had  been  developed. 

The  generalisation  language  (and,  thus,  the  rule  space)  in  LEX  is  based 
on  the  tree  of  functions  shown  in  Figure  D5d-4.  The  most  general  pattern 
is  /(x),  that  is,  any  real  function.  The  moat  specific  functions  are  integer 
and  real  constants,  sine,  cosine,  tangent,  and  so  on.  This  language  is  known 
to  have  shortcomings  (e.g.,  it  cannot  describe  the  class  of  twice  continuously 
differentiable  functions),  but  it  is  adequate  for  expressing  some  of  the  heuris¬ 
tics  useful  in  the  domain  of  symbolic  integration. 

LEX  relies  entirely  on  syntactic  generalisation  methods.  It  cannot,  for 
example,  analyse  the  solution  of  /3zcoszdx  and  realise  that,  since  OP03 
requires  only  a  real  constant  r,  the  oarticular  constant  3  can  be  generalised 
to  any  real  constant.  This  kind  of  analysis,  based  on  the  semantics  of  the 
operators,  is  done  in  STRIPS  and  HACKER.  The  advantage  of  LBX’s  syntactic 
approach  is  that  it  is  general — it  cap  be  applied  to  any  generalisation  language. 


The  Problem  Generator 


The  purpose  of  the  problem  generator  is  to  select  a  set  of  integration 
problems  that  form  a  good  teaching  sequence  (see  Article  xrv.A).  This  portion 
of  LEX  is  still  under  development,  so  only  some  strategics  that  have  been 
proposed  for  the  design  of  the  problem  generator  are  discussed  here. 

One  strategy  for  selecting  a  new  problem  is  to  find  an  operator  whose 
version  space  is  still  unrefined  and  select  a  problem  that  "splits”  the  version 
space — that  is,  an  Integral  that  matches  only  half  of  the  patterns  in  the  S 
and  G  sets.  If  the  problem  solver  can  solve  such  a  problem,  LEX  will  be  able 
to  refine  the  version  space  for  that  operator. 


Figure  D5d-4.  Function  hierarchy  used  in  LEX’s  generalisation  language. 
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A  second,  related  strategy  is  to  take  a  problem  that  LEX  has  already 
solved  and  modify  it  In  some  way.  For  instance,  having  solved  the  integral 
/3isin  xdx,  LEX  could  consider  attempting  the  integral  /  Szsin  z  dx.  This 
would  force  it  to  generalise  its  version  space  to  indicate  that  any  constant 
could  appear  (not  just  5  or  3).  The  generalisation  hierarchy  in  Figure  D5d-4 
can  be  used  to  create  such  training  problems. 

A  third  strategy  is  to  look  for  overlaps  in  the  knowledge  base.  If  there 
are  two  operators  whose  version  spaces  overlap,  the  problem  generator  can 
choose  a  problem  for  which  both  operators  are  believed  to  be  applicable. 
The  resulting  attempt  to  solve  the  problem  may  show  that  only  one  of  the 
operators  should  be  used  in  such  situations. 

Finally,  when  LEX  is  just  beginning  to  learn,  it  may  be  necessary  to  apply 
the  inverses  of  the  integration  operators  to  create  problems  of  known  difficulty 
for  the  problem  solver  to  solve.  This  is  analogous  to  the  technique  of  providing 
students  in  chemistry  courses  with  an  “unknown”  that  is,  in  fact,  deliberately 
synthesised  by  the  professor.  LEX  must  learn  how  to  control  its  search  so  that 
it  can  solve  the  training  problem  without  being  overwhelmed  by  combinatorial 
explosion. 

The  problem  generator,  more  than  any  other  component  of  the  I. EX 
system,  must  have  meta-knowledge  of  what  LEX  already  knows  and  where  it  < 
weaknesses  are.  It  must  keep  a  history  of  previous  problem-solving  attempts, 
so  that  it  docs  not  repeatedly  propose  unsolvabte  or  uninformative  problems. 
The  design  of  the  problem  generator  is,  in  fact,  the  most  difficult  part  of  the 
LEX  project.  I 

i 

i 

Conclusion 

LEX  learns  when  to  apply  the  standard  operators  of  symbolic  integra¬ 
tion.  For  each  integration  operator,  the  system  learns  a  heuristic  pattern. 
The  problem  solver  matches  these  patterns  against  the  expression  being  inte¬ 
grated  to  determine  which  operators  should  be  applied.  LEX  obtains  train¬ 
ing  instances  by  observing  its  own  attempts  to  solve  integration  problems. 
Similarly,  LEX  obtains  its  pe'forraance  standard  by  computing  the  cost  of 
the  shortest  solution  path  that  it  found  when  it  tried  to  solve  the  problem. 
The  credit-assignment  problem  is  solved  by  conducting  a  deeper  search  and 
crediting  those  decisions  that  led  to  the  minimum-cost  solution.  Decisions  that 
caused  the  problem  solver  to  depart  from  the  minimum-cost  path  are  blamed. 
Positive  and  negative  training  instances  arc  thus  extracted  and  processed  by 
the  gcncralixcr  to  update  the  version  spaces  of  the  integration  operators. 

Experiment  planning  is  implemented  in  LEX  by  the  problem  generator, 
which  employs  a  variety  of  strategics  to  select  problems  that  will  help  the 
other  components  of  the  system  refine  the  knowledge  base. 

The  primary  weakness  of  LEX,  and  a  source  of  its  generality,  is  that 
it  employs  only  syntactic  methods  of  generalization.  It  is  unable  to  reason 


about  the  meanings  of  its  operators,  and  thus  it  cannot  use  knowledge  about 
dependencies  among  operators  to  determine  how  the  heuristics  should  be 
generalised. 

LEX  does  not  attack  the  problems  of  learning  new  operators  (i.e.,  right- 
hand  sides  of  heuristic  rules)  or  learning  operator  sequences  (i.e.,  macros). 
To  learn  a  new  integration  operator,  LEX  would  need  much  more  knowledge 
about  mathematics  and  the  goais  of  integration.  This  is  a  very  difficult 
learning  problem.  The  problem  of  learning  ma-ro  operators  (i.e.,  useful 
sequences  of  operators)  and  their  applicability  conditions  has  been  addressed 
in  HACKEn  and  STRIPS.  Further  work  on  LEX  may  include  the  learning  of 
such  operators. 

Reference* 

Mitchell,  Utgoff,  and  Banerji  (in  press)  and  Mitchell,  Utgoff,  Nudcl,  and 
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MOST  AI  RESEARCHERS  employ  numerical  or  logical  representations  in  their 
learning  systems.  In  work  on  adaptive  systems,  for  example,  the  concept  to  be 
learned  is  often  represented  ns  a  vector  of  numerical  weights.  Most  of  the  other 
systems  described  in  this  chapter  represent  their  knowledge  in  logic-based 
description  languages  (c.g.,  predicate  calculus,  semantic  nets,  feature  vectors). 
A  number  of  researchers,  however,  have  developed  systems  that  employ  formal 
grammars  to  represer.*  the  learned  concepts.  This  article  discusses  the  body 
of  work,  known  as  -  rammatieal  inference,  that  seeks  to  learn  a  grammar  from 
a  set  of  training  instances. 

The  primary  interest  in  grammar  learning  can  be  traced  to  the  use  of  for¬ 
mal  grammars  for  modeling  the  structure  of  natural  language  (see  Chomsky, 
1057,  19G5).  The  question  of  how  people  learn  to  speak  and  understand  lan¬ 
guage  led  to  studies  of  language  acquisition;  interest  in  modeling  the  lan¬ 
guages  of  other  cultures  encouraged  the  development  of  computer  programs 
to  help  Geld  researchers  construct  grammars  for  unfamiliar  languages  (Klein 
and  Kuppin,  1070);  and  recent  attempts  by  pattern  recognition  researchers  to 
use  grammars  to  describe  handwritten  characters,  visual  scenes,  and  cloud- 
chamber  tracks  have  created  a  need  for  grammatical-inference  techniques. 
Thus,  all  of  these  researchers  are  interested  in  methods  for  learning  a  gram¬ 
mar  from  a  set  of  training  instances. 

A  grammar  is  a  system  of  rules  describing  a  language  and  telling  which 
sentences  are  allowed  in  the  language  (see  Article  rV.Cl,  in  Vol.  l).  Grammars 
can  describe  natural  languages — that  is,  languages  spoken  by  people — and  for¬ 
mal  languages — that  is,  simple  languages  amenable  to  mathematical  analysis. 
Ir«  natural  languages,  grammar  rules  indicate  tb  generally  accepted  ways  of 
constructing  sentences.  In  forma!  languages,  however,  grammars  are  applied 
much  more  strictly.  A  formal  grammar  for  a  language,  L,  can  be  viewed  .is  a 
predicate  that  tells,  for  any  sentence,  whether  it  ;s  grammatical,  that  is,  “in” 
the  language  L,  or  ungrammatical,  th"  not  a  legal  sentence  in  L.  From 
this  formal  perspective,  a  language  is  simply  a  potentially  infinite  set  of  all 
legal  sentences,  and  a  grammar  is  simply  a  description  of  that  set. 

One  might  expect  the  task  of  .earning  a  grammar  to  be  the  same  as  the 
task  of  learning  a  single  concept  (sec  Sec.  X1V.D3),  since  a  single  concept  can 
also  be  viewed  as  a  predicate  describing  some  set  of  objects.  Usually,  however, 
this  is  not  the  case.  Most  formal  languages  are  loo  complex  to  be  described 
b  ’  a  single  concept  or  rule.  Instead,  a  grammar  is  usually  written  as  a  set 
of  rules  that  describe  the  phrase  structure  of  the  language.  For  example,  we 
might  have  one  rule  that  says:  /l  sentence  is  an  article  followed  by  a  noun 
phrase  followed  by  a  verb  phrase.  This  could  be  written  as  the  grammar  rule: 
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(eenU.'.ji)  — »  (article)  (noun  phrase)  (verb  phase). 

This  rule  describes  the  overall  structure  of  a  sentence.  Of  course,  there  are 
many  different  kinds  of  noun  and  verb  phrases.  These  can  also  be  described 
by  phrase-structure  rules.  We  might,  for  example,  write  another  rule 

(verb  phrase)  — *  (verb) 

for  the  simplest  case  in  which  the  verb  phrase  is  just  a  single  word,  as  in  The 
boy  cried.  A  more  complex  verb  phrase  could  be  written  as 

(verb  phrase)  — *  (verb)  (article)  (noun  phrase) 

for  sentences  like  The  program  teamed  the  grammar. 

A  grammar  can  thus  be  built  out  of  a  set  of  phrase-structure  rules  (also 
called  productions).  These  rules  break  the  problem  of  determining  whether 
a  sentence  is  grammatical  into  the  subproblcms  of  determining  whether  it  is 
composed,  for  example,  of  a  grammatical  article  followed  by  a  grammatical 
nouu  phrase  followed  by  a  grammatical  verb  phrase.  In  this  way,  the  single 
concept  grammatical  sentence  is  broken  into  the  subconcepts  of  noun  phrase 
and  verb  phrase.  Moreover,  such  subconcepts  are  not  independent  but  interact 
according  to  the  grammar  rules.  Thus,  determining  whether  a  sentence  is 
grammatical  is  a  multiple-step  task  involving  the  sequential  application  of 
phrase-structure  rules.  It  is  for  this  reason  that  we  include  grammatical 
inference  in  our  survey  of  systems  that  learn  to  perform  multiple-step  tasks. 

In  this  article,  we  first  introduce  formal  grammars  and  their  uses  and 
then  discuss  the  theoretical  limits  of  grammatical  inference.  The  problem 
of  learning  a  grammar  from  training  instances  has  received  a  fair  amount  of 
mathematical  analysis.  We  describe  the  principal  results  of  this  work  along 
with  their  relevance  for  practical  learning  systems.  Finally,  we  present  the 
four  major  methods  that  have  been  developed  for  learning  grammars. 

Grammars  and  Their  Uses 

In  the  theory  of  formal  languages,  a  language  is  defined  as  a  set  of  strings, 
where  each  string  is  a  finite  sequence  of  symbols  chosen  from  some  finite 
vocabulary.  In  natural  languages,  the  strings  are  sentences,  and  the  sentences 
are  sequences  of  words  chosen  from  some  vocabulary  of  possible  words.  To 
describe  languages,  Chomsky  (1957,  1965)  introduced  a  hierarchy  of  classes 
of  languages  based  on  the  complexity  of  their  underlying  grammars.  We  will 
focus  primarily  on  the  context-free  languages  (and  grammars). 

A  context-free  language  is  defined  by  the  following: 

1.  A  terminal  vocabulary  of  symbols — the  words  of  the  language; 

2.  A  nonterminal  vocabulary  of  symbols — the  syntactic  categories  (c.g.,  “noun,” 
“veio")  of  the  language; 
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3.  A  set  of  production! — the  phrase-structure  rules  of  the  language;  and 

4.  The  start  symbol 

The  beat  way  to  understand  these  definitions  is  by  considering  an  example. 
Examine  the  following  context-free  grammar,  G,  with 

(a)  the  terminal  vocabulary  {a,  the,  boy,  girl,  petted,  held,  puppy,  kitten, 
wall,  hill,  by,  on,  with} ; 

(b)  the  nonterminal  vocabulary  {2,S,V,A,P,W,0,X}; 

(c)  the  productions 

2  -*  ASV, 

V  —  X,  V  -*  XAO,  V-  VP, 

P  -  H04S,  P  -  WAO, 

A  —  a,  A  -*  the, 

5  -»  boy,  S  —  girl, 

W  —  by,  W  — ►  on,  W  -*  with, 

0  -*  puppy,  0  -♦  kitten,  0  -♦  hill,  0  —  wall, 

X  — »  petted,  X  —  held ;  and 

(d)  the  start  symbol,  2. 

This  grammar,  G,  describes  a  language  of  simple  sentences  such  as  The  boy 
held  the  puppy  and  The  girl  on  the  hill  held  a  kitten.  It  describes  a  sentence 
by  deriving  it  from  the  start  symbol.  We  start  with  the  symbol  Z  and 
choose  a  production  that  has  2  as  the  left-hand  side.  There  is  only  one 
such  rule  in  G:  2  -*  ASV.  We  apply  this  rule  by  rewriting  2  as  the  string 
ASV.  Now  we  choose  one  of  the  nonterminals,  A,  S,  or  V,  and  find  a  rule 
that  can  be  used  to  rewrite  it.  If  we  choose  the  rule  V  — »  XAO,  our  current 
sentence  becomes  ASXAO.  We  continue  rewriting  nontermirais  (according  to 
the  production  rules)  until  the  sentence  contains  only  terminal  symbols.  A 
complete  derivation  for  the  sentence  The  boy  held  the  puppy  is  as  follows: 

Current  sentence  Chosen  production  rule 


Z 

ASV 

ASXAO 

the  SX40 

the  boy  XiO 

the  boy  held  AO 

the  boy  held  the  0 

The  boy  held  the  puppy 


(Z-*  ASV) 
(7-  XAO) 
{A  -*  the) 

( S  —  boy) 

( X  -  held) 

(A  -»  the) 

(0  —  puppy) 
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Z 


the  boy  held  the  puppy 

Figure  D5e-l.  Derivation  tree  for  the  sentence  The  boy  held  the  puppy. 

] 

This  is  usually  depicted  as  a  derivation  tree  (sec  Fig.  D5e-1). 

Depending  on  which  rules  we  choose  during  the  rewriting  process,  we  get 
different  sentences.  If  we  choose  “ 0  — »  kitten”  instead  of  “0  — *  puppy,"  we 
get  the  sentence  The  boy  held  the  kitten.  The  context-free  language  described 
by  G\  is  the  set  of  all  possible  sentences  that  can  be  derived  from  Z  by  the 
rewrite  rules  in  G.  Notice  that  we  can  also  start  our  derivation  with  some 
symbol  other  than  Z.  If  we  start  with  the  nonterminal  V,  for  example,  we 
generate  the  sublanguage  cf  all  verb  phrases  in  G.  Each  nonterminal  has  a 
sublanguage.  Thus,  each  nonterminal  represents  a  subconecpt,  such  as  noun 
phrase  (S)  or  verb  phrase  ( V),  of  the  overall  concept  of  grammatical  sentence 

(*)■  | 

In  pattern  recognition  and  language  understanding,  the  performance  task 
facing  a  computer  program  is  not  the  generation  of  grammatical  sentences  but 
their  Recognition.  Given  s  sentence,  the  problem  of  determining  whether  it 
is  grammatical — that  is,  of  finding  a  derivation  for  the  sentence — is  called 
parsing.  Many  efficient  algorithms  have  been  developed  for  parsing  sentences 
in  context-free  languages  (see  Article  IV.D,  in  Vol.  I;  Hopcroft  and  Ulltaan, 
1969). 

Extensions  •>  Context-free  Grammars 

Context-free  grammars  arc  able  to  capture  much  of  the  structure  of 
natural  and  artificial  languages,  especially  computer  programming  languages. 
However,  many  problems  require  extensions  to  the  basic  context-free  grammar 
framework. 

Transformntionad  grammars.  Some  characteristics  of  natural  lan¬ 
guage  cannot  be  modeled  with  context-free  grammars.  One  example  that  is 
frequently  cited  is  the  “respectively"  construction  in  sentences  such  as  The 
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boy  and  the  girl  held  the  puppy  and  the  kitten,  respectively.  Other  examples 
include  the  conversion  of  sentences  from  active  to  passive  voice  and  discon¬ 
tinuous  constituents  like  throw  out  in  the  sentence  He  threw  the  junk  out.  In 
response  to  these  shortcomings  of  context-free  grammars,  Chomsky  (1965)  de¬ 
veloped  the  theory  of  transformational  grammar  (see  Article  IV.C2,  in  Vol.  l), 
in  which  a  sentence  is  first  derived  as  a  so-called  deep  structure,  then  manipu¬ 
lated  by  transformation  rules,  and  finally  converted  into  surface  form  by 
phonological  rules.  The  deep  structure,  which  corresponds  to  the  basic  de¬ 
clarative  meaning  of  the  sentence,  is  derived  by  a  context-free  grammar.  The 
transformation  rules  can  modify  the  structure — but  not  the  meaning — by  al¬ 
tering  the  derivation  tree.  For  example,  a  transformation  rule  can  convert  a 
declarative  sentence  into  a  question  by  flipping  branches  of  the  tree  to  change 
the  word  order.  Under  such  a  transformation,  the  sentence  The  bov  is  hold¬ 
ing  the  dog  becomes  the  question  Is  the  boy  holding  the  dog?  Some  methods 
have  been  developed  for  learning  transformation  rules,  as  well  as  context-free 
grammars,  from  examples.  Particular  attention  has  been  given  to  learning 
these  rules  under  conditions  believed  to  be  similar  to  those  under  which  a 
child  learns  a  language. 

Stochastic  grammars.  Although  context-free  grammars  (and  transfor¬ 
mational  grammars)  can  represent  the  phrase  structure  of  a  language,  they 
tell  nothing  about  the  relative  frequency  or  likelihood  of  appearance  of  a  given 
sentence.  It  is  common,  for  instance,  in  context-free  grammars  to  use  recur¬ 
sive  productions  to  represent  repetition.  In  our  sample  grammar  above,  the 
production  V  — *  VP  is  recursive.  If  we  apply  it  over  and  over  again,  we  can 
generate  sentences  like  The  boy  held  the  puppy  on  the  wall  by  the  hill  with  the 
kitten...  Although  the  sentence  is  technically  grammatical,  it  would  be  nice 
to  represent  the  degree  of  acceptability  of  such  a  sentence. 

Stochastic  grammars  provide  one  approach  to  this  problem.  Each  produc¬ 
tion  in  a  stochastic  grammar  is  assigned  a  probability  of  selection — that  is,  a 
number  between  zero  and  one.  During  the  derivation  process,  productions  are 
selected  for  rewriting  according  to  their  assigned  probabilities.  Consequently, 
each  string  in  the  language  has  a  probability  of  occurrence  computed  as  the 
product  of  the  probabilities  of  the  rules  in  its  derivation.  If  we  took  our 
sample  grammar,  for  instance,  and  assigned  probabilities  of  .5  to  all  of  the 
rules  except  X  -*  ASV  (probability  1.0)  and  V  —  XAO  (probability  .33),  the 
string  “The  boy  held  the  puppy"  has  probability  l(.33)(.5)(.5)(.5)(.5X-5)  = 
.01,  while  the  string  “The  boy  held  the  puppy  on  the  wall  by  the  hill  with  the 
kitten”  has  probability  1.58944  X  10-7.  This  expresses  the  intuition  that  the 
second  sentence  is  very  unlikely  to  be  considered  acceptable. 

Stochastic  grammars  have  been  employed  by  pattern  recognition  research¬ 
ers  in  noisy  and  uncertain  environments  where  it  is  better  to  have  an  in¬ 
dication  of  the  degree  of  grammatically  of  a  sentence  than  a  single  yes-no 
decision.  Stochastic  grammars  also  allow  grammatical-inference  programs  to 
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represent  uncertainty  about  the  true  language  when  noisy  and  unreliable 
training  instances  are  presented. 

Graph  grammars.  In  syntactic  pattern- recognition  problems,  it  is  often 
important  to  represent  the  two-  or  three-dimensional  structure  of  “sentences” 
in  the  language.  Traditional  context-free  grammars,  however,  generate  only 
one-dimensional  strings.  Context-free  graph  grammar s  have  been  developed 
that  construct  a  graph  of  terminal  nodes  instead  of  a  string  of  terminal  symbols 
(see  Article  XIII.R3).  Rewrite  rules  in  the  grammar  describe  how  a  nonterminal 
node  can  be  replaced  by  a  subgraph.  Evans  (1971)  employs  a  set  of  graph 
grammars  to  describe  visual  scenes.  Other  researchers  have  applied  graph 
grammars  to  the  pattern  recognition  of  handwritten  characters  and  cloud- 
chamber  tracks.  This  latter  use  of  grammars  is  especially  appropriate  in 
that  the  rewrite  rules  in  the  grammar  directly  correspond  to  properties  of 
the  pattern.  For  example,  subatomic  particles  decay  into  other  particles 
only  in  certain  ways,  and  these  decay  events  can  be  modeled  naturally  with 
productions  whose  left-hand  sides  have  the  decaying  particles  and  whose  right- 
hand  sides  state  the  corresponding  particles  into  which  they  decay. 

Theoretical  Limitations  of  Grammatical  Inference 

Now  that  we  have  reviewed  some  of  the  important  kinds  of  formal  lan¬ 
guages  and  grammars,  we  turn  our  attention  to  the  problem  of  learning  these 
formal  languages  from  examples.  As  with  other  forms  of  learning  from  exam¬ 
ples,  it  is  profitable  to  view  grammatical  inference  as  a  search  through  a 
rule  space  of  all  possible  context-free  grammars  for  a  grammar  that  is  consis¬ 
tent  with  the  training  instances  chosen  from  an  instance  space.  In  language 
learning,  the  training  instances  are  usually  sample  sentences  that  have  been 
classified  by  a  teacher  to  indicate  whether  or  not  they  are  grammatical.  The 
goal  of  the  grammatical-inference  program  is  to  find  a  grammar  for  the  “true” 
language  that  underlies  the  training  instances. 

Under  what  conditions  is  it  possible  to  learn  the  correct  context-free 
language  from  a  set  of  training  instances?  This  question  has  received  a  fair 
amount  of  study,  and  several  results  have  been  obtained.  The  most  important 
result  is  that  it  is  impossible  to  learn  the  correct  language  (or  the  correct  single 
concept)  from  positive  examples  alone.  Gold  (1987)  proved  that  if  a  program 
is  given  an  infinite  sequence  of  positive  examples — that  is,  sentences  known 
to  be  “in"  the  language — the  program  cannot  determine  a  grammar  for  the 
correct  context-free  language  in  any  finite  time.  To  see  why  this  is  so,  consider 

that  at  some  point  the  program  has  received  k  strings  (ai,sj . s*}.  There 

arc  many  possible  languages  that  arc  consistent  with  these  examples.  The 
most  general,  universal  language,  which  contains  all  possible  strings  of  the 
terminal  symbols,  certainly  contains  all  of  the  strings  in  the  sample.  Similarly, 
the  trivial  language  L  =  (si.aj,  ...,a*}  is  the  most  specific  language  that 
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contains  all  of  the  strings  in  the  sample.  There  are  many  possible  languages 
between  these  two  extremes.  No  finite  sample  will  allow  the  learning  program 
to  choose  the  correct  language  from  these  various  possibilities. 

Fortunately,  in  most  learning  situations,  additional  information  is  avail¬ 
able  that  can  help  constrain  the  choices  of  the  learning  program  so  that  a 
reasonable  language,  and  its  grammar,  can  be  found.  Let  us  examine  possible 
sources  of  this  additional  information. 

Negative  examples.  Negative  training  instances  allow  the  program  to 
eliminate  grammars  that  are  too  general  (see  Article  XIV.D3»,  on  the  candidate- 
elimination  algorithm).  Gold  (1967)  showed  that  if  the  learning  program  could 
pose  questions  to  an  informant,  that  is,  ask  a  person  whether  or  not  a  given 
string  was  grammatical,  the  true  language  could  be  learned.  The  informant 
coidd  be  used  to  obtain  complete  positive  anti  negative  examples  and  thus 
determine  exactly  the  true  language.  Gold  called  this  learning  situation  infor¬ 
mant  presentation. 

Stochastic  presentation.  When  a  program  is  trying  to  learn  a  stochas¬ 
tic  context-free  grammar,  learning  is  also  possible  if  the  training  instances  are 
presented  to  the  program  repeatedly,  with  a  frequency  proportional  to  their 
probability  of  being  in  the  language.  In  this  stochastic-presentation  method, 
the  program  can  estimate  the  probability  of  a  given  string  by  measuring  its 
frequency  of  occurrence  in  the  pnitc  sample.  In  the  limit,  stochastic  presen¬ 
tation  gives  as  much  information  as  informant  presentation  of  positive  and 
negative  examples:  Ungrammatical  strings  have  zero  probability,  and  gram¬ 
matical  strings  have  positive  probability. 

Prior  distributions.  As  wc  have  seen  above,  even  after  a  set  of  positive 
instances  has  been  processed,  there  are  still  many  possible  languages,  and 
hence  many  possible  grammars,  for  the  learning  program  to  choose  from. 
Furthermore,  even  when  a  unique  language  has  been  determined,  as  with 
informant  presentation,  there  may  be  several  different  grammars  that  all 
generate  the  same  language.  One  way  to  tell  a  program  how  to  choose  the  right 
grammar  is  to  define  a  prior  probability  (or  desirability)  distribution  over  all 
possible  grammars.  The  program  can  then  choose  the  most  probable  grammar 
that  is  consistent  with  the  training  instances.  Horning  (1969)  employs  a 
prior  distribution  that  makes  simple  grammars  more  likely  than  complex 
ones,  where  simple  grammars  arc  those  that  have  fewer  nonterminals,  fewer 
productions,  shorter  right-hand  sides,  and  so  on. 

Semantics.  According  to  cognitive  psychologists,  children  receive  little 
negative  feedback  when  they  are  learning  a  language.  Consequently,  we 
arc  faced  with  the  puzzle  of  how  people  arc  able  to  learn  natural  language 
almost  entirely  from  positive  training  instances.  One  important  source  of 
information  for  children  may  be  the  meaning  of  the  sentences  they  hear.  A  few 
psychological  theories,  and  some  computer  programs  (see  below),  have  been 
developed  that  incorporate  semantic  constraints  as  a  source  of  information. 
These  theories  basically  claim  that  the  grammatical  structure  of  a  language 
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parallels  the  semantic  structure  of  the  internal  representation  that  people 
employ. 

Structural  presentation.  One  technique  employed  by  pattern  recog¬ 
nition  researchers  to  aid  grammatical  inference  is  structural  presentation,  in 
which  the  program  is  given  some  information  about  the  derivation  tree  of 
the  sample  sentences.  This  is  similar  to  the  use  of  book  training  in  Samuel’s 
checkers  program.  The  derivation  tree  provides  a  movc-by-move  (or,  in  this 
case,  a  rule-by-ruie)  performance  standard  along  with  each  training  instance. 

Grammar  restriction.  One  final  way  to  get  around  Gold’s  results  is 
to  learn  only  special  subclasses  of  the  context-free  languages.  In  particular, 
grammatical  inference  is  much  easier  for  regular  and  delimited  languages, 
which,  though  not  as  powerful  as  the  context-free  languages,  have  important 
practical  applications. 

In  summary,  then,  although  Gold’s  theorems  show  that  the  formal  prob¬ 
lem  of  learning  a  ccntext-free  grammar  from  positive  instances  alone  is  impos¬ 
sible,  there  are  many  alternative  sources  of  information  that  allow  programs, 
and  presumably  people,  to  learn  language. 


Methods  of  Grammatical  Inference 

In  this  section,  we  survey  four  basic  techniques  that  have  been  used  to 
learn  context-free  grammars  from  training  instances.  The  various  methods, 
some  of  which  parallel  the  basic  learning  methods  discussed  in  Article  XIV.Dl, 
differ  primarily  in  the  way  that  they  search  the  rule  space  and  the  kinds  of 
information  that  they  use  to  guide  that  search. 

The  first  approach  we  discuss  is  enumeration.  Enumerative,  or  generate- 
and-test,  methods  propose  possible  grammars  and  then  test  them  against 
the  data.  The  second  basic  grammatical-inference  technique  is  construction. 
Constructive  methods  usually  learn  from  positive  examples  only.  They  collect 
information  about  the  structure  of  the  sample  strings  and  use  it  to  build  a 
grammar  reflecting  that  structure.  Refinement  methods  form  a  third  impor¬ 
tant  class  of  grammatical-inference  techniques.  They  start  with  a  hypothesis 
grammar  and  gradually  improve  it  by  means  of  various  heuristics  based  on 
additional  training  instances.  Finally,  sementics-based  methods  employ  knowl¬ 
edge  of  the  meanings  of  the  sample  sentences,  to  decide  how  to  search  the 
rule  space.  Most  semantics- based  methods  have  been  developed  to  model  how 
children  learn  natural  languages. 

Rules  of  generalisation  and  specialisation  for  grammars.  Before 
describing  these  learning  methods  in  more  detail,  we  first  discuss  three  meth¬ 
ods  for  the  syntactic  generalisation  and  specialisation  of  grammars: 

1.  Merging.  A  context-free  grammar  can  be  generalised  by  an  operation 
called  merging.  Suppose  the  grammar  G  contains  two  nonterminals,  A 
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and  B.  We  can  modify  G  to  obtain  a  more  general  grammar  by  merg¬ 
ing  A  and  D— that  is,  by  creating  a  new  nonterminal,  Q,  and  replacing 
all  occurrences  of  A  and  B  by  Q.  This  has  the  effect  of  pooling  the 
sublanguages  of  A  and  B  to  create  a  new  sublanguage,  Q,  whose  strings 
may  appear  anywhere  that  either  the  strings  of  A  or  the  strings  of  B 
could  have  appeared.  Suppose,  for  example,  that  in  our  sample  grammar 
discussed  above,  we  merged  S  (subjects)  and  0  (objects)  to  obtain  Q.  The 
productions  of  the  grammar  G  become: 

Z-  AQV 

V-X,  V  -»  XAQ,  V->  VP, 

P  -  WAQ, 

A  -»  a,  A  —  the, 

W  -»  by,  IV  -*  on,  W  — *  with, 

Q  -*  puppy,  Q  -*  kitten,  Q  —  hill,  Q  -»  wall, 

Q  —  boy,  Q  —  girl, 

X  — *  petted,  X  — *  held . 

Previously  ungrammatical  sentences  like  The  puppy  petted  the  boy  arc  now 
allowed.  The  language  is  thus  larger  and,  consequently,  more  general. 

2.  Splitting.  The  inverse  of  merging  is  a  specialization  process  called  split- 
ting.  We  can  specialise  a  grammar  by  splitting  the  sublangua^,:  of  one 
nonterminal,  N,  into  two  smaller  sublanguages,  N\  and  /Vj.  This  is 
accomplished  by  replacing  some  occurrences  of  iVin  the  grammar  by  A'( 
and  others  by  Alj.  In  the  grammar  above,  for  instance,  we  could  split 
the  A  (article)  nonterminal  into  A i  and  Ai  to  obtain  the  grammar: 

Z—A,  QV, 

V-X,  V—  XA,Q,  V-VP, 

P  —  W/lj  Q, 

— *  a,  At  —  the, 

W-by,  W-on,  IV-with, 

Q  -»  puppy,  Q  —  kitten,  Q  —  hill,  Q  -»  wall, 

Q  —  boy,  Q  -»  girl, 

X  —  petted,  X  -»  held . 

Now  all  sentences  must  begin  with  “a,"  and  all  prepositional  phrases  and 
object  phrases  must  use  “the."  The  previously  grammatical  sentence 
The  boy  petted  the  puppy  is  now  illegal.  This  language  is  therefore  more 
specialized. 

3.  Disjunction.  One  operation  that  is  similar  to  merging  is  called  disjunc¬ 
tion.  In  disjunction,  wo  choose  two  strings,  ,i,  and  s;,  and  create  a  new 
nonterminal,  £>,  whereby  the  rules  D  — •  at  and  l)  -*  .13  are  added  to  the 
grammar.  Every  occurrence  of  the  strings  s,  and  St  in  existing  produc¬ 
tions  is  replaced  by  D.  Kor  example,  we  could  disjoin  AO  and  AS  in  our 
sample  grammar  to  create  the  new  nonterminal,  N  (noun  phrase).  The 
grammar  then  becomes: 
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Z-NV, 

V-X,  V-+XN,  V->VP, 

P-  WN, 

N->AS,  N->AO, 

A  -*  a,  A  -»  the, 

5  -»  boy,  S  -*  girl, 

W  —  by,  W  -*  on,  W  -*  with, 

0  -*  puppy,  0  -*  kitten,  0  -»  hill,  0  -*  wall, 

Jf  — *  petted,  Jf  -*  held . 

This  operation  is  similar  to  merging,  except  that  it  can  be  applied  to 
jtrtnji  of  terminals  and  nonterminals.  If  both  of  *i  and  sj  are  simple 
nonterminal  symbols,  disjunction  has  the  same  effect  as  merging.  If  only  ' 
one  of  si  or  sj  is  a  nonterminal,  the  operation  is  called  wittitution. 

These  rules  of  generalisation  can  be  applied  to  move  from  one  point  in 
the  rule  space  (i.e.,  one  grammar)  to  another.  We  now  turn  our  attention  to 
the  four  basic  methods  of  grammatical  inference  and  show  how  they  apply 
these  operations  to  search  the  space  of  possible  context-free  grammars. 


Enumerative  Method* 

Enumerative  methods  generate  grammars  one  by  one  and  test  each  to 
determine  how  well  it  accounts  for  the  training  instances.  The  first  enumera¬ 
tive  method  we  consider  is  that  of  Uorning  (I960),  who  developed  a  procedure 
for  finding  the  most  plausible  stochastic  grammar  consistent  with  a  set  of 
stochastically  presented  training  instances.  The  general  idea  behind  Horning’s 
method  is  to  enumerate  all  possible  grammars  in  order  of  simplicity  and  choose 
the  first  grammar  that  is  consistent  with  the  training  data.  The  actual  algo¬ 
rithm  is  somewhat  more  complicated,  however,  since  Morning  seeks  the  moat 
likely  stochastic  grammar,  that  is,  the  grammar  G  that  is  most  likely  to  have 
generated  the  observed  set  S  of  sample  strings.  This  is  expressed  formally  as 
the  grammar  G  that  maximises  P[G  |  S),  that  is,  the  probability  of  G  given  S. 
Unfortunately,  it  is  difficult  to  compute  P(G  |  S)  directly  from  the  training 
instances.  Bayes'  theorem,  however,  provides  a  way  of  computing  P[G  |  S) 
from  three  other  quantities,  P(G),  P(S  |  G),  and  P{S ): 

P(G)  X  P(S  |  G) 

P{G\S)- - — - , 

where  P(G )  is  the  a  priori  probability  that  G  is  the  '‘true"  grammar,  P[S) 
is  the  a  priori  probability  of  observing  the  particular  sample  5,  and  P[S  |  C) 
is  the  probability  of  observing  S  given  the  grammar  G.  Since  P(S )  is  inde¬ 
pendent  of  G,  we  can  maximize  P(G  |  5)  by  just  maximizing  the  numerator 
P'(G  |  5)  =  P{G)  X  P{S  |  G).  The  probabilities  P{G)  and  P{S  |  G)  can  be 
computed  for  any  particular  grammar  G. 
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The  probability  P(S  |  G )  that  the  training  instances  5  will  be  generated 
by  the  stochastic  grammar  G  can  be  computed  directly  from  G  by  parsing 
each  sentence  in  S.  The  problem  of  computing  /'(G)  is  more  difficult,  however. 
Horning  sought  to  have  the  a  priori  probability  of  G  reflect  the  complexity 
of  the  grammar  G.  Simple  grammars  should  be  highly  probable;  comptex 
grammars  should  be  improbable.  Consequently,  he  developed  the  idea  of  a 
grammar-grammar,  that  is,  a  stochastic  grammar  that  generates  a  itochattic 
grammar  as  its  terminal  string.  Such  a  grammar-grammar  can  be  constructed 
from  a  terminal  vocabulary  of  symbols  such  as  A,  B,  C,  Z,  etc.  Since,  as 
we  have  seen  above,  a  stochastic  grammar  generates  short  strings  with  a  much 
higher  probability  than  it  does  long  strings,  the  grammar-grammar  generates 
simple  grammars  with  a  much  higher  probability  than  it  does  complex  ones.  In 
particular,  the  probability  P(C)  is  the  probability  that  the  grammar-grammar 
would  generate  G. 

Since  we  can  compute  P(G)  and  P(S  |  C),  we  can  use  Bayes’  theorem 
to  compute  f"(G  (  S).  Therefore,  if  we  compute  P'(G  \  S)  for  all  possible 
grammars,  G,  we  can  find  the  grammar  that  most  likely  generated  S.  Such 
a  procedure  is  impossibly  inefficient,  however.  Instead,  Horning  used  the 
following  technique.  First,  he  developed  a  procedure  that  could  enumerate 
all  possible  stochastic  grammars  starting  with  the  most  likely  grammar,  Gt, 
and  continuing  on  in  order  of  decreasing  probability  P(G,).  Next,  he  noticed 
that  P'[Gi  |  S)  did  not  have  to  be  computed  for  all  grammars  but  only  for 
those  grammars  whose  probability  P(G,)  was  greater  than  P'(Gi  |  5).  This 
is  because  once  P(Gt)  falls  below  P'(G\  |  S),  there  is  no  way  that  multiplying 
by  P(S  (  Gj)  will  ever  exceed  P'(G i  |  S),  since  P(S  \  Gi)  is  always  less  than 
or  equal  to  1. 

Consequently,  Horning’s  method  enumerates  all  grammars  G<  starting 
with  G i  and  continuing  until  P(G<)  <  P*(Gl  f  5).  The  probability  P'(G<  |  S) 
is  computed  for  each  grammar  G,,  and  the  grammar  that  maximixes  P'(G,-  j  S) 
is  output  as  the  grammar  most  likely  to  have  produced  the  set  of  examples,  S. 

The  algorithm  is  theoretically  correct— it  always  finds  the  best  grammar — 
but  it  is  still  too  inefficient  for  all  but  the  smallest  grammars.  Therefore, 
Horning  modified  the  grammar  generator  to  generate  only  grammars  that 
were  deductively  acceptable  (DA).  A  grammar  is  deductively  acceptable  if  it 
generates  every  string  in  the  sample,  S,  and  if  every  production  in  C  is  used 
to  derive  at  least  one  of  the  training  instances.  In  other  words,  a  DA  grammar 
must  be  consistent  with  the  training  instances  and  must  not  be  overly  specific 
or  cluttered  by  useless  productions.  It  can  be  shown  that  all  DA  grammars 
with  k  +  I  nonterminals  can  be  obtained  by  splitting  DA  grammars  with  k 
nonterminals.  I1  urthermore,  once  a  grammar  ceases  to  be  deductively  accept¬ 
able,  no  further  splits  will  make  it  deductively  acceptable,  since  it  is  already 
overly  specific. 

These  facts  were  used  by  Horning  to  organize  the  rulc-spacc  search. 
Starting  with  the  most  general  (and  most  likely)  DA  grammars,  repeated  splits 
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are  made  until  either  the  grammars  cease  to  be  deductively  acceptable  or  their 
a  priori  probability  P[Gi)  falls  below  the  bound  P'(G  i  |  S ).  The  probability 
P'(Gi  |  5)  is  computed  for  all  of  the  generated  grammars,  and  the  grammar 
that  maximises  P'(Gi  |  S)  is  selected.  This  procedure,  although  more  efficient 
than  the  first  one,  is  still  of  theoretical  interest  only. 

A  second  enumcrative  method  makes  use  of  training  instances  to  guide 
the  enumeration  of  plausible  grammars.  Pao  (1969)  describes  an  approach  to 
grammatical  inference  that  resembles  the  plan-generate-test  paradigm  of  the 
DENDRAL  program  (see  Sec.  VII.C2,  in  Vol.  II).  In  the  initial  planning  phase, 
Pao’s  algorithm  analyzes  the  (positive)  training  instances  and  constructs  a 
trivial  grammar — that  is,  a  very  specific  grammar  that  generates  only  the 
training  examples.  A  partially  ordered  set  (actually,  a  lattice)  of  plausible 
grammars  can  be  generated  by  merging  nonterminals  from  this  trivial  gram¬ 
mar.  During  the  gcncratc-and-test  phase,  Pao’s  algorithm  enumerates  all  of 
these  grammars  in  order,  from  most  specific  to  most  general,  and  tests  them 
by  consulting  an  informant. 

Pao's  algorithm  generates  two  grammars  at  a  time,  G  and  II,  and  uses 
an  informant  to  eliminate  one  of  the  two.  The  informant  is  presented  with 
a  new  sentence,  a,  that  is  generated  by  G  but  not  by  H.  If  the  informant 
says  a  is  in  the  “true"  language,  then  II  and  all  grammars  more  specific  than 
H  are  removed  from  further  consideration.  Also,  the  set  of  grammars  more 
general  than  H  (but  not  more  general  than  G )  is  searched  in  order  from 
general  to  specific,  and  grammars  that  do  not  generate  a  are  discarded.  If, 
on  the  other  hand,  the  informant  says  that  a  is  not  in  the  “true"  language, 
then  G  and  all  grammars  more  general  than  G  are  removed  from  further 
consideration.  The  generating  and  testing  of  possible  grammars  continues 
until  only  one  possible  grammar  remains.  This  search  through  the  partially 
ordered  set  of  all  possible  grammars  is  similar  to  Mitchell’s  (1978)  candidate- 
elimination  algorithm  (see  Article  XIV. D3»).  In  Pao’s  program,  though,  an 
active  experimentation  approach  is  employed  to  search  the  space  rather  than 
waiting  for  new  training  instances  to  drive  the  search. 

Unfortunately,  this  method  does  not  work  for  general  context-free  gram¬ 
mars.  The  basic  algorithm  works  only  for  regular  grammars — that  is,  gram¬ 
mars  whose  productions  all  have  the  form  N  — »  tM  or  Af  — »  t  for  t,  a  single 
terminal  symbol,  and  M,  a  single  nonterminal  symbol.  In  regular  languages, 
there  is  no  difficulty  finding  a  test  sentence  a  to  distinguish  between  two  gram¬ 
mars  G  and  II.  Unfortunately,  this  cannot  be  done  for  general  context-free 
languages.  Pao  has  extended  the  method  to  handle  delimited  grammars — 
a  somewhat  larger  class  of  grammars  than  the  regular  grammars. 

Constructive  Methods 

Constructive  methods  attempt  to  build  a  plausible  grammar  using  only 
the  information  from  a  positive  sample  with  no  informant.  From  Gold's 
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theorems,  it  is  clear  that  this  problem  is  ill-formed,  since  no  unique  language 
is  determined  by  a  sefof  positive  instances.  However,  various  heuristics  have 
been  developed  for  constructing  simple,  fairly  general  grammars  from  positive 
instances  only. 

One  important  set  of  heuristics  is  based  on  the  idea  of  the  distribution 
of  substrings  in  the  language.  In  context-free  languages,  certain  classes  of 
strings,  such  as  noun  phrases  and  prepositional  phrases,  tend  to  appear  in 
the  same  contexts  in  different  sentences.  This  suggests  that  we  might  be  able 
to  discover  interesting  classes  of  strings  by  looking  at  their  surroundings  in 
the  set  of  sample  sentences.  For  instance,  the  words  a  and  the  both  tend 
to  occur  at  the  beginnings  of  sentences,  so  perhaps  they  should  be  grouped 
together  to  form  the  class  of  articles.  This  is  done  by  creating  a  nonterminal 
A  and  inventing  the  production  rules  “4  — »  a”  and  a/l  —  the."  Distributional 
analysis  has  been  employed  by  Harris  ( 1 96*1),  i'u  (1975),  Kelley  (1967),  and 
Klein  and  Kuppin  (1970) 

For  regular  grammars,  Fu  (1975)  has  applied  a  particular  kind  of  distribu¬ 
tional  analysis  based  on  the  idea  of  the  formal  derivative  of  a  string.  The 
formal  derivative  of  a  string  s  is  the  set  of  strings 

D,L  *  (t  |  the  string  it  is  in  the  language  L)  , 

that  is,  all  of  the  strings  t  that  follow  s  in  the  given  language  L  in  sentences 
where  a  is  at  the  beginning  of  the  sentence. 

Formal  derivatives  can  be  employed  to  construct  regular  grammars  in  a 
straightforward  way.  Imagine  that  we  have  a  grammar  G,  and  we  are  in  the 
process  of  generating  a  sentence.  Suppose  that,  so  far,  we  have  generated  the 
string  aU,  where  U  is  a  nonterminal  and  a  is  a  terminal  string.  If  we  take 
formal  derivatives  for  every  string  so  that  appears  in  the  sample  (where  a  is 
a  single  terminal  symbol),  we  can  create  new  nonterminals  for  each  distinct 
formal  derivative.  We  can  add  the  productions 

U-aVl 

U-bVt 

U-mVk 

to  the  grammar,  G,  where  V\,  Vo . Vk  correspond  to  the  formal  derivatives 

of  aa,ab,  ...,am.  The  effect  of  this  construction  is  to  group  together  all  of 
the  strings  in  the  formal  derivative  of  aa,  for  example,  and  place  them  in 
the  sublanguage  for  Kj.  We  can  construct  the  entire  grammar  G  by  initially 
taking  a  to  be  the  null  string  and  U  to  be  the  start  symbol. 

The  chief  difficulty  of  distributional  methods  is  that  some  definition  of 
similar  contexts  is  needed  so  that  strings  that  appear  in  similar  contexts  can 
be  grouped  into  the  sublanguage  for  a  new  nonterminal  symbol.  Problems 
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can  also  arise  when  one  string  is  in  two  different  sublanguages  and  therefore 
appears  in  different  contexts.  The  word  program,  for  example,  can  be  both  a 
noun  and  a  verb. 

Another  approach  to  constructive  inference  of  grammars  is  to  look  for 
repetition  in  the  sample  and  model  it  as  a  recursive  production.  This  method 
is  rarely  sufficient  in  itself  to  construct  the  whole  grammar,  but  it  can  be  used 
in  combination  with  other  methods.  Consider,  for  example,  the  set  of  training 
instances  {a,  aaa,  aaaa}.  A  reasonable  grammar  to  infer  has  the  productions 
S  — *  a  and  S  — *  Sa  and  generates  all  possible  strings  of  repeated  as. 

To  employ  this  repetition  heuristic,  it  is  helpful  to  know  the  properties  of 
repetition  for  different  kinds  of  grammars.  For  regular  grammars,  iteration 
always  takes  the  form  of  repeated  choice  of  a  string  without  reference  to 
any  other  strings.  However,  for  context-free  languages,  repetition  can  be 
more  complicated.  One  important  theorem  about  context-free  languages 
(called  the  uvxyz  theorem)  states  that  if  a  sufficiently  long  string  uvxyz 
is  in  the  language,  then  so  is  the  string  uvkxykz  as  well;  that  is,  v  and 
y  are  repeated  an  equal  number  of  times.  This  can  be  represented  by  a 
self-embedding  production  of  the  form  X  -»  VXY.  SolomonolT  (1964)  and 
Maryanski  (1974)  describe  inference  methods  based  on  searching  for  double 
cycles  of  the  uvkxykz  variety.  Once  a  possible  cycle  is  found,  it  can  be  tested 
by  consulting  an  informant. 

Refinement  Methods 

Refinement  methods  formulate  a  hypothesis  grammar  and  then  refine  it 
by  applying  simplification  heuristics  or  by  gathering  new  training  instances. 
Knobe  and  Knobc  (1977),  Tor  example,  present  an  algorithm  that  creates 
an  initial  hypothesis  grammar,  G,  and  then  enters  a  refinement  cycle  in 
which  it  repeatedly  accepts  a  new  grammatical  string,  refines  G  to  include 
the  string,  and  generalizes  and  simplifies  G.  The  initial  grammar  includes  a 
distinct  nonterminal  for  each  of  the  terminal  symbols.  In  the  course  of  the 
algorithm,  these  nonterminals  are  generalized  by  merging.  The  basic  learning 
cycle  proceeds  as  follows: 

Step  1.  Accept  a  grammatical  string  (i.c.,  a  positive  training  instance)  and 
attempt  to  parse  the  string  with  the  current  grammar,  G.  if  the 
parse  succeeds,  repeat  step  l;  otherwise,  go  to  step  2. 

Step  2.  Compute  a  list  of  partial  oarses  and  sort  it  according  to  generality. 

(A  partial  parse  is  a  string  of  terminals  anil  nonterminals  in  which 
parts  of  the  original  training  string  have  been  partly  parsed  into 
nonterminals;  the  more  general  partial  parses  are  shorter,  since 
most  of  the  sentence  has  been  successfully  parsed.)  Hypothesize 
the  production  S  -*  P,  where  5  is  the  start  symbol  and  F  is  the 
most  general  partial  parse.  (This  allows  tiie  trailring  instance  to  be 
parsed  successfully.)  Use  the  modified  grammar  to  generate  a  test 
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sentence,  and' ask  the  informant  "if  the  lest  sentence  is  grammatical. 

If  it  is,  go  to  step  3;  otherwise,  try  the  next  most  genera!  partial 
parse,  and  repeat  until  a  sufficiently  specific  production  has  been 
found. 

Step  3.  Generalise  and  simplify  the  grammar  by  applying  some  of  the 
merging  and  substitution  heuristics  described  below. 

The  third  step  of  generalisation  and  simplification  is  important,  because 
it  is  in  this  step  that  the  new  production  5  —  P  is  integrated  into  the  grammar 
and  connected  to  existing  production  rules.  Many  different  simplification  and 
generalisation  techniques  have  been  developed  by  various  researchers.  We 
survey  a  number  of  these  here. 

Generalisation  by  disjunction.  One  important  simplification  tech¬ 
nique  is  to  apply  disjunction  (see  above)  to  replace  two  similar  s.rings  s  and  t, 
which  appear  on  the  right-hand  sides  of  productions,  by  a  single  nonterminal. 
There  are  two  basic  heuristics  for  deciding  whether  a  and  t  are  similar:  inter¬ 
na/  similarity  and  external  similarity.  The  internal-similarity  heuristic  com¬ 
pares  the  sublanguages  generated  by  s  and  t.  If  the  sublanguages  are  similar, 
the  heuristic  proposes  that  s  and  t  are  similar  and  should  be  disjoined.  The 
external-similarity  heuristic,  on  the  other  hand,  compares  the  contexts  in 
which  s  and  t  appear.  As  in  the  constructive  technique  of  distributional 
analysis,  if  s  and  t  appear  in  similar  contexts,  the  heuristic  recommends  that 
they  be  disjoined.  There  are  many  important  special  cases  of  these  heuristics: 

1.  Heuristics  hated  on  internal  similarity  .The  first  internal-similarity  heuris¬ 
tic  is  subsumption.  If  the  language  generated  by  s  is  a  superset  of  the 
language  generated  by  t,  then  s  and  t  should  be  disjoined.  This  often 
occurs  when  t  u  a  single  nonterminal,  X,  and  the  rule  X  —  t  is  among 
the  productions  for  X  in  the  grammar. 

If  s  and  t  are  both  single  nonterminals,  X  and  Y,  a  second  internal 
heuristic  can  be  applied.  This  heuristic  compares  the  right-hand  sides, 
u  and  v,  of  production  rules  of  the  form  X  -»  u  and  Y  -*  v,  to  see  if 
they  are  similar.  If  they  are,  X  and  Y  can  be  merged. 

A  third  internal-similarity  heuristic  is  k-tail  equivalence.  Two  strings  s 
and  t  are  k-tail  equivalent,  (or  some  nonnegative  integer  k,  if  the  sets  of 
strings  of  length  k  or  less  that  they  generate  are  the  same.  Thus,  s  and 
t  are  judged  similar  if  the  short  strings  that  they  generate  are  the  same. 

This  heuristic  can  be  applied  by  choosing  a  value  for  k  and  merging 
groups  of  nonterminals  that  are  k-tail  equivalent.  As  k  gets  small,  this 
heuristic  causes  more  generalisation. 

2.  Heuristics  based  on  external  similarity.  The  one  heuristic  Tor  external 
similarity  is  to  look  at  productions  in  which  s  and  t  appear  cn  the  right- 
hand  side  of  productions.  If  s  and  i  appear  in  similar  contexts  within 
the  productions,  they  can  bo  disjoined.  Various  spec’ll  cases  of  this  . 
heuristic  have  been  used,  including  the  case  in  which  s  and  t  are  both 
single  nonterminals. 
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Hypothesising  iteration.  As  with  constructive  methods,  if  productions 
such  as  X  —  a  and  X  -»  aa  are  present,  a  recursive  production  A*  — »  Ad  can 
be  introduced. 

Shorthand  substitution.  When  a  string  a  appears  many  times  on  the 
right-hand  side  of  productions,  it  is  often  good  to  create  a  new  nonterminal, 
A,  replace  all  occurrences  of  s  by  A,  and  add  the  production  A  -*  a  to  the 
grammar.  This  simplifies  the  grammar  without  modifying  the  language  that 
it  generates.  The  advantage  of  the  simplification  is  that  it  is  easier  to  apply 
the  various  merging  heuristics  to  a  simplified  grammar. 

The  fc-tail  heuristic  was  employed  by  Biermann  and  Feldman  (1970)  in  the 
inference  of  regular  grammars.  Various  of  the  other  heuristics  arc  employed 
by  Klein  and  Kuppin  (1970),  Evans  (1971),  Knobe  and  Knobe  (1977),  and 
Cook  and  Rosenfcld  (1976).  Cook  and  Rosenfcld  are  concerned  with  stochastic 
grammars  and  use  their  heuristics  to  simplify  grammars  with  a  hill-climbing 
procedure  based  on  a  numerical-complexity  measure. 

Stmaniiea-baaed  Mtthoda 

The  fourth  basic  approach  to  grammatical  inference  employs  semantic 
constraints  to  guide  the  search  for  plausible  grammars.  Most  of  this  work 
has  centered  on  language  acquisition  by  children.  The  child  is  given  positive 
examples  of  sentences  and  is  assumed  to  know  the  meanings  of  individual 
words  in  isolation.  Furthermore,  the  situation  in  which  the  sentence  was 
uttered,  and,  thus,  some  idea  about  its  overall  meaning,  is  assumed  to  be 
known  by  the  child.  In  mout  work,  no  negative  examples  are  provided, 
nor  is  an  informant  available.  This  is  because  most  research  in  psychology 
(e.g.,  Brown  and  Hanlon,  1970)  has  found  that  children  receive  little  or  no 
feedback  concerning  the  grammatically  of  the  sentences  they  utter.  Pinker 
(1979)  discusses  the  work  of  several  researchers  who  have  studied  grammatical 
inference  under  these  assumptions,  including  Anderson  (1977)  and  Hamburger 
and  Wcxler  (1975). 

Anderson’s  Language  Acquisition  System  (LAS)  attempts  to  learn  a  context- 
free  grammar  for  English  from  training  instances  that  include  a  representation 
of  the  meaning  of  each  sentence.  The  Human  Associative  Memory  (HAM; 
Article  XI.E2)  network  notation  is  used  to  represent  these  sentence  meanings. 
Learning  proceeds  in  a  cycle  similar  to  that  of  Knobe  and  Knobe  (1977):  A 
sentence  and  its  meaning  are  input,  and  LAS  attempts  to  parse  the  sentence. 
If  the  parse  fails,  the  grammar  is  extended  according  to  some  refinement 
heuristics  so  that  the  training  sentence  can  be  parsed  and  assigned  the  correct 
meaning.  One  such  heuristic  adds  a  word  to  a  sublanguage — for  example,  it 
adds  chair  to  the  sublanguage  for  (noun) — when  the  word  is  located  at  a  place 
in  the  HAM  net  similar  to  the  place  of  other  words  in  the  sublanguage.  This 
is  a  special  case  of  the  general  heuristic  that  the  structure  of  the  semantic 
representation  is  reflected  in  the  structure  of  the  syntax  of  the  language.  A 
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more  sophisticated  version  of  this  heuristic  is  the  graph  deformation  condition , 
which  states  that  branches  in  the  HAM  represention  of  the  sample  sentence 
are  not  allowed  to  cross.  This  heuristic  rules  out  certain  parses  that  would 
result  is  an  ill-formed  HAM  structure.  Anderson  also  employs  one  syntactic 
heuristic:  Two  nonterminals  are  merged  if  they  have  similar  sublanguages. 

The  work  of  Hamburger  and  Wexler  (1975)  is  more  theoretical  in  nature 
and  is  concerned  with  showing  that  transformational  grammars  (sec  Chomsky, 
1965)  are  learnable.  In  their  model,  the  learner  is  repeatedly  given  a  sentence 
and  its  meaning,  where  the  meaning  is  represented  as  a  deep-structure  parse 
tree  (based  on  a  deep-structure  context-free  grammar).  The  learner  must 
find  a  set  of  transformation  rules  that  succeed,  for  each  sample  sentence, 
in  converting  the  deep  structure  into  the  given  sentence.  Hamburger  and 
Wexler  are  proponents  of  Chomsky's  nativist  theory  of  language  acquisition, 
which  asserts  that  people  have  built-in  limits  and  biases  that  provide  essential 
constraints  for  the  language-learning  process.  Consequently,  their  model  of 
language  learning  includes  several  factors  that  limit  the  complexity  of  possible 
transformations. 

Given  these  limits,  Hamburger  and  Wexler  show  that  the  desired  set  of 
transformations  can  be  learned  by  a  program  as  follows.  As  each  training 
instance  (a  sentence  and  its  deep  structure)  is  received,  the  learner  tries  to 
transform  the  deep  structure  into  the  surface  sentence  by  applying  its  current 
set  of  transformations.  If  this  succeeds,  the  learner  goes  on  to  the  next  input 
example.  If  not,  the  learner  randomly  adds,  deletes,  or  alters  a  transformation 
and  goes  on.  This  method  will  work  as  long  as  the  learner  does  not  repeat 
transformation  rules  known  to  be  incorrect.  Plainly,  this  learning  procedure 
is  not  practical,  but  it  does  demonstrate  that  learning  transformation  rules 
under  these  assumptions  is  possible. 

Conclusion 

The  expressiveness  of  grammars  for  use  in  AI  knowledge  representation 
is  somewhat  limited,  so  interest  in  the  difficult  problem  of  grammatical  infer¬ 
ence  is  also  correspondingly  limited  in  the  AI  community.  This  is  especially 
so  because  of  the  impractical  nature  of  many  of  the  grammatical-inference 
systems  developed  thus  far.  However,  future  work  on  the  problem  may  yield 
more  powerful  inference  systems,  and  an  understanding  of  past  work  may  well 
be  helpful  in  research  on  related  learning  problems. 

iiefertneet 

We  have  surveyed  here  the  motivations,  limitations,  and  methods  of  gram¬ 
matical  inference.  More  detailed  surveys  of  grammatical  inference  in  the  con¬ 
text  of  cognitive  psychology  are  given  in  Pinker  (1979)  and  Rceker  (1976). 
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Surveys  of  grammatical  inference  for  use  in  syntactic  pattern  recognition  are 
given  in  Fu  (1974,  1975),  Biermann  and  Feldman  (1972),  and  Gonxales  and 
Thompson  (1978). 
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